GrateTile: Efficient Sparse Tensor Tiling for CNN Processing by Lin, Yu-Sheng et al.
GrateTile: Efficient Sparse Tensor Tiling for
CNN Processing
Yu-Sheng Lin
Inventec Corporation
lin.john@inventec.com
Hung-Chang Lu
National Taiwan University
hclu@media.ee.ntu.edu.tw
Yang-Bin Tsao
National Taiwan University
yangbin@media.ee.ntu.edu.tw
Yi-Min Chih
National Taiwan University
yiminchi@media.ee.ntu.edu.tw
Wei-Chao Chen
Inventec Corporation
chen.wei-chao@inventec.com
Shao-Yi Chien
National Taiwan University
sychien@media.ee.ntu.edu.tw
Abstract—We propose GrateTile, an efficient, hardware-
friendly data storage scheme for sparse CNN feature maps
(activations). It divides data into uneven-sized subtensors and,
with small indexing overhead, stores them in a compressed yet
randomly accessible format. This design enables modern CNN
accelerators to fetch and decompressed sub-tensors on-the-fly in
a tiled processing manner. GrateTile is suitable for architectures
that favor aligned, coalesced data access, and only requires
minimal changes to the overall architectural design. We simulate
GrateTile with state-of-the-art CNNs and show an average of
55% DRAM bandwidth reduction while using only 0.6% of
feature map size for indexing storage.
Index Terms—Neural Network Hardware, Data Compression,
Sparse Matrix.
I. INTRODUCTION
Convolutional neural networks (CNNs) are now considered
one of the most widely used machine learning techniques in
computer vision and image processing [1]–[5]. Its primary
operation is the convolution between kernels (weights) and
feature maps (activations), which can consume lots of power
through MAC operations and memory accesses. To alleviate
this problem, one can take advantage of the redundancies in
the feature maps and skip unnecessary processing with sparse
computation. For example, neural networks using the ReLU
activation function may have highly sparse feature maps with
up to 80% zero values clipped from negative values. It is
also possible to fine-tune the network to generate kernels with
higher sparsity [6], [7], so that CNN accelerators can reduce
operation waste by gating the processing elements (PEs) and
avoid scheduling zero operations [6], [8], [9].
Compared with the energy wasted through redundant op-
erations, data access power is arguably more critical for
future accelerator designs because memory bandwidth has
been growing slower than the speed of PEs [10]. That is,
an algorithm can become increasingly memory bound for
future architectures. Newer networks tend to adopt smaller
convolution kernels with deeper layers, which further reduces
operation count at the cost of increased memory usage. In
Fig. 1, we calculate the power consumption breakdown accord-
ing to [11] by simulating several popular CNNs with SCALE-
sim on a 16 × 16 systolic array [12]–[14]. Notice that the
percentage of MAC power decreases from 35% in 2012 to 15%
AlexN
et
(2012
) VGG
 16
(2014
)Googl
eNet
(2014
) V
DSR
(2016
)ResNe
t 50
(2016
)
0%
25%
50%
75%
100% DRAM
Feature Read
DRAM
Kernel Read
DRAM Write
SRAM
PE
Fig. 1. Power breakdown of popular CNN applications using SCALE-
sim. According to this simulation, the DRAM feature read is the primary
power draw for CNNs.
in 2016, while the DRAM feature read consistently consumes
over half of the remaining power. Modern CNN accelerators
have already utilized on-chip SRAM to effectively reduce data
access power (Fig. 2a). To push the envelope further, we could
compress the feature maps, but the compression scheme may
not be compatible with the tiled processing nature of modern
CNN accelerators (Fig. 2b). A better approach is to divide
the feature maps into independently compressed subtensors to
make them randomly-accessible for tiled processing (Fig. 2c).
This design principle allows the memory controller to fetch
only the required subtensors and assemble them into tiles
on-the-fly, without wasting bandwidth on over-fetching data
outside of the tiles.
After analyzing how CNN architectures divide, compress,
and store the feature maps [15], [16], we observe that the divi-
sion and storage process has an equally if not more substantial
impact on the overall DRAM bandwidth when compared
with the compression algorithms for individual subtensors.
Fig. 3 illustrates the trade-off between using a larger subtensor
size, which causes wasted fetch (Fig. 3a), and a smaller
subtensor size which causes data fragmentation (Fig. 3b). To
break free from this trade-off, we propose GrateTile, which
divides the feature maps into uneven sizes for optimal CNN
processing (Fig. 3c). By inserting smaller subtensors between
larger tensors, GrateTile combines the storage efficiency of
larger subtensors without the over-fetching waste. By adding
GrateTile functionality to existing CNN accelerators, we can
gain approximately 55% bandwidth improvement over the un-
compressed baseline, and 6-27% bandwidth improvement over
compressed tiles according to our simulation. In summary, our
contributions are:
• A bandwidth-efficient storage scheme for sparse feature
ar
X
iv
:2
00
9.
08
68
5v
1 
 [c
s.L
G]
  1
8 S
ep
 20
20
DRAM (Off-chip)
Uncompressed
Feature Map
PE ArraySRAM
Uncompressed DRAM 
Read Bandwidth
Larger power
High bandwidth due to sparsity
Cached Feature
Map Tile
SRAM
Read
Smaller power
Higher bandwidth
(a) Tiled CNN processing without compression is inefficient.
Compressed DRAM
Read Bandwidth
Smaller power
Low bandwidth with potential
over-fetching waste 
Compressed
Feature Map
?
DRAM (Off-chip) PE ArraySRAM
D
ecom
pres sor
Com
presso r
(b) Compressed feature maps do not work well with tiling.
Compressed DRAM
Read Bandwidth
Smaller power
Lowest bandwidth with no
over-fetching
D
ecom
pres sor
Com
presso r
DRAM (Off-chip) PE ArraySRAM
Feature Map divided 
into Subtensors
(c) Independently compressed subtensors are tiling-friendly.
Fig. 2. Reducing DRAM bandwidth via compression. While it can be more
power-efficient to compressed feature maps in DRAMs, traditional compres-
sion algorithms tend to be incompatible with tiled processing. Dividing feature
maps into subtensors and compress them independently is an effective method
for this purpose.
maps with CNN accelerator-friendly memory access pat-
terns,
• A universal methodology to convert a sparse tensor into
the GrateTile packing given CNN layer and accelerator
configurations, and
• A method for integrating GrateTile into existing acceler-
ators with small hardware modification and overhead.
II. RELATED WORKS
While GrateTile focuses on exploiting feature map sparsity
that tends to be dynamic, kernel sparsity tends to be more
static. Researchers have gained great success exploiting this
attribute to reduce DRAM bandwidth and power consumption
for CNNs. These methods operate by dropping small kernel
values followed by retraining to compensate for accuracy loss.
Many of these works also focus on designing PEs that can
skip unnecessary MAC or math operations to save power.
EIE [6] is a fully-connected layer accelerator, which uses
two indices for the next non-zero values in the feature maps
and kernels so that it only performs operations with non-
zero operands. Several CNN accelerators also apply similar
methods to skip unnecessary operations [8], [17]. This process-
ing flow forces serialization of operation for different kernel
values and therefore limits the parallelism. SCNN decomposes
a matrix multiplication or a CNN into several outer product
operations [9] and is thus an outer product PE array. Since
the outer product is zero when either kernel or feature map
is zero, it compacts the input to ensure there is no wasted
operation. However, since the output address from each PE is
different, this causes irregular address calculation and results
2 3
0 1
Memory alignment
(cache line, bus width...)
0 1 2 3
Waste: ununsed subtensors
Subtensor division Actual memory layout
Halo
...
=
(a) Divide feature map into large subtensors.
1 2 3 4 5 6
8
0
9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 6
0 1 2 3 4 6
Large pointers!
7
Actual memory layout
...
Subtensor division
...
Waste: memory misalignment=
Alternative memory layout
(b) Divide feature map into small subtensors.
2 3
4 5
0 1
6
No waste!
Small pointers!
0 1 2 3 4 5 6
0 1 2 3 5 6
Actual memory layoutSubtensor division
...
(c) GrateTile feature map division.
Fig. 3. A comparison of feature map division methodologies. The use of
large subtensor sizes [15] can cause partial subtensor accesses. On the other
hand, small subtensor sizes [16] can cause data fragmentation. Both cases
lead to bandwidth wastes. GrateTile uses hybrid subtensor sizes aligned to
the CNN data fetching pattern, preventing both types of problems at the same
time.
00100001 a,b0,0,a,0,0,0,0,b +
(a) Bitmask compression.
0,0,a,0,0,0,0,b (2,a),(4,b)
(b) ZRLC compression.
Fig. 4. Two popular methods for CNN feature map compression. Bitmask
and ZRLC are widely used algorithms in CNN accelerators due to their
simplicity for hardware implementation.
in a distribution unit that is three times larger than the PE
array.
The methods above assume the zero values can appear
randomly. On the other hand, some researchers believe the
sparsity can be structural, and subtensors of kernel tensors
may repeatedly appear in different positions. CirCNN and
PermDNN assume any row in the kernel tensor is a rotation
of its neighboring row [18], [19]. These repeating structure
of kernel tensors enable many methods for reducing operation
count, such as replacing the multiplication by table lookup.
Wu et al. use the vector quantization to cluster kernels by k-
means and replace them by their cluster center indices to save
memory [20].
III. GRATETILE FOR SPARSE FEATURE MAPS
A. The Need for Hardware Aligned Storage
As discussed before in Fig. 3, to support tiled CNN process-
ing, we need to divide the feature maps into subtensors and
9-1 80
=*
(a) Processing the first 8× 8 CNN block.
177 168
=*
(b) Processing the second 8×8 CNN block.
Left boundaries
-1,7,15... = 7  (mod 8)
Right boundaries
9,17,25... = 1  (mod 8)
GrateTile for
input feature map
[
(c) GrateTile.
Fig. 5. Deriving the GrateTile Configuration. The proposed methodology divides the feature map tensor along all possible boundaries of the accessed
tensors using modular arithmetic.
tw0
-k
(tw-1)s
2k+1
(tw-1)s+k+1
(a)
d
(tw-1)s+kd+1
d
(tw-1)s
tw0
-kd
(b)
Fig. 6. Deriving GrateTile configuration for CNN layers. (a) A stan-
dard CNN with (k, s, d, tw) = (1, 2, 1, 6) and (b) a dilated CNN with
(k, s, d, tw) = (1, 1, 2, 6).
compress them independently. Many architectures adopt sim-
ple compression algorithms such as bitmask or zero run-length
coding (ZRLC) [8], [15]–[17] with uniform division scheme
for the tensors (Fig. 4). While suitable for hardware imple-
mentation, this can lead to a waste of memory bandwidth.
Even though the convolution operation only requires data from
surrounding pixels (i.e., halo), we may end up fetching the
entire neighboring subtensors because the compressed blocks
are not randomly accessible (Fig. 3a). Furthermore, modern
memory hierarchies, like DRAM or cache, favor aligned and
coalesced access, and the variable size of compressed data can
result in fragmentation and wasted bandwidth. We can reduce
fragmentation with index memory (Fig. 3b), but this pointer
index can be too big for the on-chip SRAM, or contribute to
additional latency and bandwidth if stored in the DRAM
Fig. 3c shows how GrateTile can eliminate both the afore-
mentioned problems. By unevenly dividing the subtensors, we
have fewer subtensors and smaller index memory sizes, while
ensuring proper boundary alignment for tiled processing. Next,
we explain the methodology for finding an optimal division
given a CNN and the hardware configuration.
B. Computing GrateTile Configuration
Consider a CNN architecture processing a 3 × 3 convo-
lution on 4 input channels, using an 8 × 8 tile size for the
output feature map (Fig. 5). Our goal is to create a feature
map division that (1) avoids accessing partially compressed
subtensors, and (2) minimizes the number of subtensors to
reduce data fragmentation. In this example, to compute the first
output tile, we need to fetch a 10×10×4 input tile (Fig. 5a).
When processing the next output tile, we would step toward
the right by 8 elements on the feature map (Fig. 5b) to fetch
the next input tile. Since the step size is constant within one
layer of CNN processing, the left (orange) and right (cyan)
access boundaries form two arithmetic progressions, denoted
as Bl = {−1, 7, 15 · · · } and Br = {9, 17, 25 · · · }. The
GrateTile configuration is simply the divisions formed by
both bounaries, namely the union G = Br ∪ Bl, or simply
G = {1, 7} (mod 8), as shown in Fig. 5c. Because 7−1 = 6
and 1− 7 = 2 (mod 8), each spatial dimension of the feature
map is divided into two uneven sizes of 2 and 6, which results
in four subtensor shapes—6 × 6, 2 × 6, 6 × 2, and 2 × 2. A
10 × 10 window is then composed of one 6 × 6, two 2 × 6
and 6×2, and four 2×2 subtensors. Also, since the halo only
appears in the spatial dimension, this division process is not
necessary along the channel dimension.
We now generalize this example for all modern CNN
layers, whose computations can be defined the following three
parameters:
• Kernel size—denoted as 2k + 1 since kernel sizes tend
to be odd integers.
• Two output elements convolving two windows with a
stride of s. When s > 1, it means a smaller output feature
map and thus less computation cost.
• Dilated CNN [21] convolves strided input elements for
one output element to enlarge the equivalent window size,
and we denote this stride as d.
Besides, we denote the output tile size as th × tw. Fig. 6a
shows a CNN with d = 1. To compute the leftmost output
element, we fetch from the feature map a window starting at
the left boundary of −k and right boundary of (tw−1)s+k+1.
Since the offset between two neighboring subtensors is stw,
we can define the GrateTile configuration as follows
G = {−k, (tw − 1)s+ k + 1} (mod stw)
= {−k, k − s+ 1} (mod stw)
. (1)
For dilated CNN shown in (Fig. 6b), a similar process yields
G = {−kd, kd− s+ 1} (mod stw).
An interesting property for GrateTile is that any configu-
ration for mod N is also valid for mod N ′ if N ′|N . For
example, consider an AlexNet CONV1 whose (k, s, tw) =
(5, 4, 8), its GrateTile configuration is G = {27, 2} (mod 32),
but G = {3, 2} (mod 8) is also a valid GrateTile configura-
tion. In the extreme case, the GrateTile degenerates to Fig. 2c
when N ′ = 1. It is thus possible to use a single N across
all CNN layers to keep the hardware implementation simple,
and we show that N = 8 can be a suitable choice for most
purposes in Section IV.
C. Memory Layout for Compressed Subtensors
Given a GrateTile configuration, we need to store these
subtensors in a data structure that complies with the memory
alignment requirement to maximize the benefits of compres-
sion. Since subtensors can have different compressed sizes,
we have to store the extra metadata (e.g., pointers in Fig. 3)
separately from the compressed subtensors. Such metadata are
usually too large to fit into the SRAM. For example, the
size of metadata would be 72 kB for AlexNet CONV2 if
each subtensor contains 8 words and requires a 32-bit pointer.
Therefore, a more reasonable choice for metadata storage
would be in the DRAM. However, we must be careful since
fetching them would cause extra bandwidth and access latency.
Fig. 7b shows how we store the subtensors and meta-
data. Since GrateTile is a near-uniform subtensor division
methodology, it is relatively straightforward to extend the data
structure from uniform division (Fig. 7a) for our purposes.
For example, a GrateTile configuration G = {1, 7} (mod 8)
is equal to dividing every 8 × 8 subtensors further into four
small subtensors, and therefore its metadata would extend
from the uniform division structure for size 8. With uniform
division, every subtensor has a pointer to the starting address
of the subtensor. We extend this structure by adding the
compressed sizes of the four smaller neighboring subtensors.
Thus, accessing these subtensors is a two-step procedure,
where we first locate the starting address from the pointer
and then add the subtensor sizes to get the actual offset for
each subtensor.
We now calculate the size of the metadata as follows. As
shown in Fig. 7a, in a uniform subtensor division, we need
a pointer for each 8 × 8 × 8 = 512 words. Since GrateTile
only stores these subtensors in aligned addresses, given a 32-
bit addressing space with a 16-byte cache alignment, the size
of the pointer is 32 − log2 16 = 28. We now extend this
to the GrateTile division and represent the sizes of the four
neighboring subtensor by the number of 16-byte caches lines
it used. For this purpose, different GrateTile configurations
yield different subtensor sizes, which may require different
numbers of bits. To this end, we select the maximum number
among the GrateTile configurations supporting popular CNN
kernel sizes. For kernel sizes 3, 7 and 11, we have G = {1, 7}
(mod 8), and a 512-word uniform subtensor divides into four
subtensors of sizes 64, 192, 192, and 576 bytes, requiring
3+4+4+6 = 17 bits of metadata. For kernel sizes 5 and 9, we
have G = {2, 6} (mod 8), which requires 5+5+5+5 = 20
bits of metadata. Therefore, for every 512 words of feature
map stored in GrateTile format, we need 28+20 = 48 bits of
metadata, which represents only 0.6% of overhead.
IV. EVALUATIONS
In this section, we discuss the bandwidth reduction with
GrateTile in sparse CNN processing compared with sev-
eral uniform division methods used by other CNN accel-
1 2
Uniform division
1 2
1
2
Memory layout + pointers
address
(a) Uniform division
1 2
3 4
5 6
7 8
GrateTile division
1 5
1
5
Memory layout + pointers
1
5
2
6
3
7
4
8
address subtensor sizes
2 3 4 8
(b) GrateTile division
Fig. 7. The GrateTile data structure. (a) With uniform division, subtensors
align with cache lines, with pointers used to locate the starting addresses. (b)
The GrateTile is a near-uniform division; therefore, we extend the uniform
division structure by adding the size information for the smaller neighboring
subtensors.
Small tile
(NVIDIA)
Large tile
(Eyeriss)
0%
20%
40%
60%
B
an
dw
id
th
 s
av
ed
(H
ig
he
r 
is
 b
et
te
r)
Optimal
GrateTile (mod 8)
Uniform 8 × 8 × 8
Uniform 4 × 4 × 8
Uniform 2 × 2 × 8
Uniform 1 × 1 × 8
Fig. 8. Overall bandwidth reduction. GrateTile provides the best overall
bandwidth reduction compared to uniform division on different hardware
platforms. Here, the optimal bandwidth reduction ratio is defined by the ratio
of zero values in the feature map.
erators [15], [16]. We simulate memory fetch patterns of
representative layers from popular CNN networks [1]–[4]:
• AlexNet: All layers, except for the first input layer since
it takes dense input images.
• VGG 16: The layers right before the pooling layers.
• ResNet 18: The layers right after the pooling layers.
• ResNet 50: The downsampling CNN layers and the layers
before them.
• VDSR: Every four layers of VDSR, since it consists of
18 layers of the same shape.
Fig. 8 illustrates the geometric mean of bandwidth savings
from these benchmarks. We use the bitmask compression and
the mod 8 GrateTile configuration for this experiment; we
shall discuss the logic behind the selection of this number
later. Note that GrateTile saves an average of 55% feature map
accessing bandwidth, which represents 6-27% more savings
than uniform subtensor division. We discuss the details of how
we arrive at these results in the remainder of this section, as
well as insights from our experiments.
A. Experiment Setup
We perform our simulation on two types of hardware plat-
forms that are characteristics of CNN architectures, namely,
an NVIDIA GPU and the Eyeriss architecture. We assume
the memory alignment size is 8 words (128 bits), which is
in line with the AXI bus width of [15]; NVIDIA GPUs also
adopt a similar alignment configuration, which is 8 floating
numbers (256 bits) per one L1-cache line. To determine the
TABLE I
GRATETILE CONFIGURATIONS USED IN OUR EXPERIMENTS.
CNN type Tile size modeled after GrateTile
(kernel,stride) NVIDIA Eyeriss configuration
(3, 1) 10x18x8 18x18x16 G = {1, 7} (mod 8)
(3, 2) 9x17x8 17x17x16 G = {0, 7} (mod 8)
(5, 1) 12x20x8 20x20x16 G = {2, 6} (mod 8)
TABLE II
THE FEATURE MAP METADATA OVERHEAD.
Feature map Feature map metadata size
subdivision mode Bits per KB feature map Percentage
GrateTile (mod 4) (28 + 20)× 4 = 192 2.36%
GrateTile (mod 8) 28 + 20 = 48 0.59%
GrateTile (mod 16) (28 + 20)÷ 4 = 12 0.15%
Uniform 8x8x8 28 0.34%
Uniform 4x4x8 28× 4 = 112 1.37%
Uniform 2x2x8 28× 16 = 448 5.47%
Uniform 1x1x8 32× 64 = 2048a 25.0%
a The addresses are 32-bit since we compactly pack each
subtensor here.
maximum processing tile size, we must consider both double
buffering (prefetching) and convolutional kernels, and assume
a reasonable processing tile of less than one-fourth of the
buffer size. Therefore in the following experiments, for an
NVIDIA Volta architecture with 64 KB shared memory in
one of its processor array, we define the small tile (NVIDIA)
configuration to hold a 4K-word feature map subtensor. For
Eyeriss with a 108 KB global buffer, we define the large
tile (Eyeriss) configuration to hold 16K words.
Fig. 9 illustrates a bandwidth reduction breakdown of Fig. 8
for individual network layers. Table I shows the processing tile
size and GrateTile configuration for various CNN layers. We
compare GrateTile with uniform subtensor division schemes
ranging from 1 × 1 × 8 to 8 × 8 × 8, under both the small
and large tile configurations. All subtensors are aligned with
the cache lines except for the 1 × 1 × 8 division, where we
compactly the subtensors because each subtensor is too small
to fill up one cache line. Table II and III show the bandwidth
overhead caused by fetching the metadata.
B. Discussions
From these experiments, we can obtain several useful in-
sights:
(1) Tile size and bandwidth reduction. For uniform tensor
division, an optimal division size does not exist because it is
a trade-off between the partial tensor accesses and the data
indexing overhead. For example, the larger uniform 8× 8× 8
division can derive the most benefits by going with larger
processing tiles, resulting in a bandwidth improvement of 13%
(40.9% - 27.9%). In comparison, a smaller uniform division
like 2 × 2 × 8 does not derive similar benefit with larger
processing tile (40.2%-40.1% = 0.1%); it also consumes much
more metadata than the 8×8×8 division. Since GrateTile uses
a small number of subtensors to prevent partial tensor accesses,
it outperforms the best uniform division methods (4× 4× 8)
according to Fig. 8 and Table III.
TABLE III
THE IMPACT OF METADATA ON BANDWIDTH REDUCTION.
Feature map Bandwidth saved (%)
division mode Without overhead With overhead
NVIDIA Eyeriss NVIDIA Eyeriss
GrateTile (mod 4) 46.6 46.6 44.2 44.2
GrateTile (mod 8) 54.7 54.9 54.1 54.3
GrateTile (mod 16) 56.2 —a 56.0 —a
Uniform 8x8x8 28.4 41.2 27.9 40.9
Uniform 4x4x8 45.0 49.5 43.6 48.1
Uniform 2x2x8 45.6 45.8 40.1 40.2
Uniform 1x1x8 56.5 56.7 30.7 30.9
a In the GrateTile (mod 16) subtensor division with the small tile
configuration (NVIDIA), a fetched tile is smaller than a subtensor,
so GrateTile is not applicable to this case.
(2) Metadata overhead. In Table II, we calculate the
metadata required for every 8×8×8 = 512-word feature map
and extrapolate the results to different division methods. In
Table III, we show the results with and without the bandwidth
overhead caused by accessing the metadata. Observe that
without the overhead, the bandwidth saving generally becomes
better as the uniform subtensor sizes get smaller. The only
exception is 2 × 2 × 8 for large tile configuration due to its
cache fragmentation (Fig. 3c). The compacted 1×1×8 division
can be considered as a performance upper-bound since there is
neither partial cache or partial cache accesses, and GrateTile
mod 8 is only 1.8% worse than this upper-bound. However,
the 1×1×8 division adds 24.4% metadata fetching overhead,
making it performs the worst compared with other division
methods.
(3) Limitations and GrateTile configuration. Because
GrateTile is best suited for tile-based CNN processing, adopt-
ing GrateTile may need to bandwidth overhead by creating
unnecessary subtensor division, for example, if an accelerator
processes a whole channel before the next channel. In this sce-
nario, a tile and a feature map have the same sizes at the spatial
dimensions, which happens in layers like AlexNet CONV5
or VGG 16 CONV5 3 where a uniform division subtensor
(16× 16) can contain the whole input feature map (14× 14).
For these layers, using GrateTile requires 4% more bandwidth
than not dividing the subtensor at all. It also explains why the
mod 16 GrateTile has slightly better performance (56.0%-
54.1% = 1.9%) than mod 8 in Table III. However, this
subtensor division does not work in the smaller tile hardware
configuration, which implies a large workspace requirement to
compress the subtensors. Therefore, we claim that the mod 8
GrateTile is a reasonable choice for most network layers and
hardware configurations.
V. CONCLUSIONS
We propose GrateTile, a hardware-friendly methodology
for storing and accessing compressed, sparse feature maps.
GrateTile divides feature maps into uneven subtensors, and
in the process, avoids wasteful fetches of partial subten-
sors and partial cache lines. Furthermore, it only requires
a small metadata indexing overhead to keep track of the
2
(5,1)
3
(3,1)
4
(3,1)
5
(3,1)
AlexNet
1_2
(3,1)
2_2
(3,1)
3_3
(3,1)
4_3
(3,1)
5_3
(3,1)
VGG 16
2_2
(3,1)
3_2
(3,1)
4_2
(3,1)
ResNet 18
2_2
(3,2)
2_9
(3,1)
3_2
(3,2)
3_12
(3,1)
4_2
(3,2)
4_9
(3,1)
ResNet 50
3
(3,1
7
(3,1)
11
(3,1)
15
(3,1)
19
(3,1)
VDSR
-20%
0%
20%
40%
60%
80%
B
an
dw
id
th
 s
av
ed
 (N
VI
D
IA
)
(H
ig
he
r 
is
 b
et
te
r)
CONV layer
(kernel,stride)
Network
Optimal
GrateTile (mod 8)
Uniform 8 × 8 × 8
Uniform 4 × 4 × 8
Uniform 2 × 2 × 8
Uniform 1 × 1 × 8
(a) Bandwidth compression ratio in a small tile platform modeled after NVIDIA Volta.
2
(5,1)
3
(3,1)
4
(3,1)
5
(3,1)
AlexNet
1_2
(3,1)
2_2
(3,1)
3_3
(3,1)
4_3
(3,1)
5_3
(3,1)
VGG 16
2_2
(3,1)
3_2
(3,1)
4_2
(3,1)
ResNet 18
2_2
(3,2)
2_9
(3,1)
3_2
(3,2)
3_12
(3,1)
4_2
(3,2)
4_9
(3,1)
ResNet 50
3
(3,1
7
(3,1)
11
(3,1)
15
(3,1)
19
(3,1)
VDSR
-20%
0%
20%
40%
60%
80%
B
an
dw
id
th
 s
av
ed
 (E
ye
ri
ss
)
(H
ig
he
r 
is
 b
et
te
r)
CONV layer
(kernel,stride)
Network
Optimal
GrateTile (mod 8)
Uniform 8 × 8 × 8
Uniform 4 × 4 × 8
Uniform 2 × 2 × 8
Uniform 1 × 1 × 8
(b) Bandwidth compression ratio in a large tile platform modeled after Eyeriss.
Fig. 9. Bandwidth reduction comparison using GrateTile and other subtensor division methods.
locations of the compressed subtensors. It can be a simple-
yet-effective modification for existing CNN accelerators since
it is mostly independent of the compression algorithms and
requires changes only to the existing feature map division
methods. Our experiments show that GrateTile can save up to
55% more bandwidth than the baseline and 6-27% compared
with uniform subtensor division methods.
For hardware compression and decompression, our prelim-
inary SystemVerilog implementation shows promising area
efficiency compared to ZRLC, bitmask, and dictionary-based
algorithms, with better scalability and less serialization. We
will continue to investigate in this front and share our findings
with the community.
REFERENCES
[1] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution
using deep convolutional networks,” Transactions on Pattern Analysis
and Machine Intelligence, 2016.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in Neural Infor-
mation Processing Systems 25, 2012.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Conference on Computer Vision and Pattern
Recognition, 2016.
[4] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in International Conference on Learning
Representations, 2015.
[5] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with con-
volutions,” in Conference on Computer Vision and Pattern Recognition,
2015.
[6] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and
W. J. Dally, “EIE: efficient inference engine on compressed deep neural
network,” in International Symposium on Computer Architecture, 2016.
[7] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, “Sparse
convolutional neural networks,” in Conference on Computer Vision and
Pattern Recognition, June 2015.
[8] Y. H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture
for energy-efficient dataflow for convolutional neural networks,” in
International Symposium on Computer Architecture, 2016.
[9] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan,
B. Khailany, J. Emer, S. W. Keckler, and W. Dally, “SCNN: An
accelerator for compressed-sparse convolutional neural networks,” 2017.
[10] W. A. Wulf and S. A. McKee, “Hitting the memory wall: implications
of the obvious,” ACM SIGARCH computer architecture news, 1995.
[11] M. Horowitz, “1.1 computing’s energy problem (and what we can do
about it),” in Digest of Technical Papers - IEEE International Solid-State
Circuits Conference, 2014.
[12] A. Samajdar, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna,
“SCALE-Sim: Systolic cnn accelerator simulator,” arXiv preprint
arXiv:1811.02883, 2018.
[13] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor
processing unit,” in International Symposium on Computer Architecture,
2017.
[14] S. Y. Kung, VLSI array processors, 1988.
[15] R. M. (EETimes). (2018) ARM gives glimpse of AI core. [Online].
Available: https://www.eetimes.com/document.asp?doc id=1333307#
[16] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and
Y. Chen, “Cambricon-X: An accelerator for sparse neural networks,” in
International Symposium on Microarchitecture, 2016.
[17] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and
A. Moshovos, “Cnvlutin: Ineffectual-neuron-free deep neural network
computing,” in International Symposium on Computer Architecture,
2016.
[18] C. Deng, S. Liao, Y. Xie, K. K. Parhi, X. Qian, and B. Yuan, “Permdnn:
Efficient compressed dnn architecture with permuted diagonal matrices,”
in International Symposium on Microarchitecture, 2018.
[19] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. Choudhary, and S.-F.
Chang, “An exploration of parameter redundancy in deep networks with
circulant projections,” in International Conference on Computer Vision,
2015.
[20] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convolutional
neural networks for mobile devices,” in Conference on Computer Vision
and Pattern Recognition, 2016.
[21] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated
convolutions,” in International Conference on Learning Representations,
2016.
