Partitioning Compute Units in CNN Acceleration for Statistical Memory
  Traffic Shaping by Jung, Daejin et al.
1Partitioning Compute Units in CNN Acceleration for
Statistical Memory Traffic Shaping
Daejin Jung, Sunjung Lee, Wonjong Rhee, Fellow, IEEE, and Jung Ho Ahn, Senior Member, IEEE
Department of Transdisciplinary Studies, Seoul National University
Abstract—Convolutional Neural Networks (CNNs) have become the default choice for processing visual information, and the design
complexity of CNNs has been steadily increasing to improve accuracy. To cope with the massive amount of computation needed for
such complex CNNs, the latest solutions utilize blocking of an image over the available dimensions (e.g., horizontal, vertical, channel,
and kernel) and batching of multiple input images to improve data reuse in the memory hierarchy. While there has been a large
collection of works on maximizing data reuse, only a few studies have focused on the memory bottleneck problem caused by limited
bandwidth. Bandwidth bottleneck can easily occur in CNN acceleration as CNN layers have different sizes with varying computation
needs and as batching is typically performed over each layer of CNN for an ideal data reuse. In this case, the data transfer demand for
a layer can be relatively low or high compared to the computation requirement of the layer, and therefore temporal fluctuations in
memory access can be induced eventually causing bandwidth problems. In this paper, we first show that there exists a high degree of
fluctuation in memory access to computation ratio depending on CNN layers and functions in the layer being processed by the
compute units (cores), where the compute units are tightly synchronized to maximize data reuse. Then we propose a strategy of
partitioning the compute units where the cores within each partition process a batch of input data in a synchronous manner to
maximize data reuse but different partitions run asynchronously. Because the partitions stay asynchronous and typically process
different CNN layers at any given moment, the memory access traffic sizes of the partitions become statistically shuffled. Thus, the
partitioning of compute units and asynchronous use of them make the total memory access traffic size be smoothened over time, and
the degree of partitioning determines a tradeoff between data reuse efficiency and memory bandwidth utilization efficiency. We call this
smoothing statistical memory traffic shaping, and we show that it can lead to 8.0% of performance gain on a commercial 64-core
processor when running ResNet-50.
F
1 INTRODUCTION
Emerging Convolutional Neural Network (CNN) [3] is one of
the most popular machine learning methods, especially for
image classification and object detection. A typical CNN has
a deep structure and multiple types of filters to be able to
model complicated functions, necessitating a high computa-
tional power. As most of CNN’s operations allow parallelism, a
CNN can be fit well into the existing data-parallel architectures,
such as GPGPUs [8], manycore processors [11], FPGAs [1], and
emerging deep-learning accelerators [7].
These architectures (called CNN accelerators hereafter)
block each image over dimensions (e.g., horizontal, vertical,
channel, and kernel) and batch multiple images [9] to maxi-
mally exploit memory hierarchy where components closer to
compute units are smaller in capacity but offer higher band-
width and energy efficiency. A large body of work has ad-
dressed maximizing data reuse heuristically or systematically.
For instance, Yang et al. [16] proposed an optimal loop blocking
and reordering technique for convolution and fully-connected
layers to maximize data reuse in the memory hierarchy. Their
analytical model optimizes memory traffic on multi-level mem-
ory hierarchy at a given on-chip storage budget.
Modern CNN models (architectures) such as Inception-
v4 [13] and ResNet [4] tend to have a large number of layers. Be-
cause the layers have different designs with a varying number
of channels, number of kernels, and size of convolution filters,
the degree of data reuse also varies across the layers. The varia-
tion can be significant, leading to a severe temporal fluctuation
in bandwidth demands for different layers, especially for off-
chip main memory. Such a fluctuation is not a problem when
the memory bandwidth is sufficiently large, but that is not true
©2017 IEEE. This is the author supplied version of the paper which appears
at IEEE Computer Architecture Letters (DOI: 10.1109/LCA.2017.2773055).
as we will show in Section 4. Furthermore, memory bandwidth
demand per accelerator is steadily increasing because more
transistors are becoming available due to the finer-pitch process
technology, and because circuit- and microarchitecture-level
optimizations (e.g., mixed-precision) to arithmetic units are
allowing more arithmetic units to be integrated [8].
Providing a sufficient main-memory bandwidth to accom-
modate peak demands can be a solution. However, it is ex-
tremely inefficient as increasing main-memory bandwidth ac-
companies area/energy overhead due to bulky I/Os and de-
teriorated signal integrity, leading to a high-cost premium [12].
Rather, it is desired to shrink the gap between peak and average
main-memory bandwidth utilization by regulating bandwidth
demands from the numerous compute units. To the best of
our knowledge, previous studies have not focused on this
bandwidth bottleneck problem of memory hierarchy that is
caused by temporal fluctuations of resource demands.
In this paper, we first show that the gain of data reuse
by batching diminishes on the latest CNN models, because
they tend to have lean (smaller kernels and fewer neurons)
and deep (more layers) structures. On the other hand, the
synchronous nature of batching exacerbates the bandwidth
fluctuation issue. To address this problem, we propose to divide
compute cores into multiple partitions and make each partition
internally operate synchronously but make different partitions
operate asynchronously. This solution slightly sacrifices the
degree of data reuse, but its temporal bandwidth balancing
through statistical memory traffic shaping [15] will be shown
to outweigh the sacrifice.
2 DATA REUSE CHARACTERISTICS OF CNN
Modern data-parallel architectures provide increasing compu-
tation performance for CNN processing. The latest NVIDIA
ar
X
iv
:1
80
6.
06
54
1v
1 
 [c
s.D
C]
  1
8 J
un
 20
18
20
100
200
300
400
C
o n
v
B
N
P
o o
l i n
g
S
p l
i t
R
e s
i d
u a
l
B
N
C
o n
v  
a
B
N
C
o n
v  
b
B
N
C
o n
v  
c
B
N
E
W
S
S
p l
i t
R
e s
i d
u a
l
B
N
C
o n
v  
a
B
N
C
o n
v  
b
B
N
C
o n
v  
c
B
N
E
W
S
S
p l
i t
R
e s
i d
u a
l
B
N
C
o n
v  
a
B
N
C
o n
v  
b
B
N
C
o n
v  
c
B
N
E
W
S
S
p l
i t
P
o o
l i n
g
F C
Conv 1 Conv 2_1 Conv 4_6 Conv 5_3
M
e m
o r
y  
B
a n
d w
i d
t h
 ( G
B
/ s
)
Time
Fig. 1: Memory bandwidth utilization on ResNet-50 CNN layers over time [4]. Con-
volutional layers are interleaved with other filter types (e.g., batch normalization
(BN) and split functions), each exhibiting different bandwidth demands.
0
0.1
0.2
0.3
0.4
AlexNet
(2012)
VGG
(2014)
GoogleNet
(2014)
ResNet
(2015)
Me
mo
ry 
Ac
ce
ss
 R
ati
o o
f 
We
igh
ts 
ov
er 
To
tal
 Da
ta
Fig. 2: Memory access ratio of weights
over total data transfers of convolutional
and fully-connected layers.
GPUs based on Pascal microarchitecture reach 10 TFLOPS for
single-precision operations per chip [2]. Intel’s Knights Landing
(KNL [11]) manycore CPUs offer 6 TFLOPS per chip with 72
x86 cores featuring AVX-512 SIMD support. As lower-precision
arithmetic is viable especially for inference, CNN accelerators
including GPUs, CPUs, and FPGAs are expected to populate
more functional units tailored to 16- and 8-bit operations,
further increasing compute capability. For example, Google’s
Tensor Processing Unit [7] supports 8-bit integer performance
of 92 TOPS through 64K matrix-multiply units.
Fetching data from a large memory such as an off-chip main
memory per arithmetic operation is costly in terms of energy
and bandwidth efficiencies. Therefore, it is critical to effectively
utilize a hierarchy of memory components (e.g., L1, L2 caches
or scratchpad memory) by exploiting locality. A convolutional
layer consists of K kernels, and each kernel receives C feature
maps (or input channels) passed from the previous layer,
conducts convolution, and produces an output feature map to
pass to the next layer. The K feature maps produced in this
layer are further processed by other filters, such as pooling,
rectified linear unit (ReLU), and batch normalization (BN). The
convolution filtering, which dominates CNN computation, can
be made highly parallel and provides abundant opportunities
for reusing the fetched weight data. For example, the number
of operations using a point in an input channel is proportional
to the number of kernels and their size whereas a weight in a
kernel is accessed for the number of times that is proportional to
the number of channels and the feature map size per channel.
Furthermore, a kernel weight can be reused further through
processing input images in batch (called batching [9] hereafter).
By making sub-blocks of feature maps and kernel weights
and then by conducting computation on these sub-blocks, we
can achieve a high memory locality in reusing the weight data.
For example, the size of CNN weights often reaches tens to
hundreds of megabytes [10], far exceeding the on-chip buffer
size. Finding optimal blocking configurations maximizes data
reuse at a certain memory capacity leading to the highest ratio
of computation over accesses to the next level in the memory
hierarchy. Yang et al. [16] proposed a systematic approach
to find optimal loop blocking and reordering configurations
for the convolution and fully-connected layers. The MKLDNN
library [5] we utilize in this study also exploits similar schemes.
The latest CNN models have increasing number of lay-
ers; for example, ResNet-50 [4] has 50 convolutional layers,
which dominate processing time, interleaved with the other
aforementioned functions to improve a recognition rate. These
convolutional layers have a wide variation in the sizes of
channels/kernels and the number of channels/kernels. The
computation to memory access ratio heavily depends on these
factors; for example, if all the weights of the kernels at a
Layer Input size(H×V)
# of input
channels
Out size
(H×V)
Kernel
(H×V, K)
BW
(GB/s) FLOPS
Pooling 112×112 64 56×56 3×3, - 254 0.6T
Conv2 1a 56 ×56 64 56×56 1×1, 64 174 2.9T
Conv2 2a 56×56 256 56×56 1×1, 64 120 3.0T
Conv3 2b 28×28 128 28×28 3×3, 128 55 3.7T
Conv4 3a 14×14 1024 14×14 1×1, 256 76 3.0T
Conv5 3b 7×7 512 7×7 3×3, 512 15 2.2T
TABLE 1: Memory bandwidth and TFLOPS of various layers on
ResNet-50 [4] (the abbreviations H, V, and K stand for horizontal,
vertical, and the number of kernels respectively).
certain convolutional layer fit in the last-level cache, they are
loaded just once while processing a batch of input images.
This manifests as a huge diversity in main-memory bandwidth
utilization over layers (hence over time) as depicted in Figure 1
and Table 1 (experimental setup is specified in Section 4),
but few studies have focused on the temporal fluctuation of
memory bandwidth demands and the resulting inefficiency.
The conventional strategy of maximizing data reuse over
memory hierarchy is still effective if a CNN accelerator is
equipped with main memory that can sustain the peak band-
width, completely absorbing the temporal fluctuation and
hence its performance being unaffected. However, increasing
peak main-memory bandwidth requires a high-cost premium.
Contemporary accelerators exploit 3D stacking of memory and
better interface material such as silicon interposer [12] to in-
crease memory bandwidth, but their values are still around
hundreds of GB/s, (e.g., 732GB/s for an NVIDIA GPU with
HBM2 [2]), which is much lower than the peak bandwidth
demands from compute cores with half-precision performance
(over 20 TFLOPS). With insufficient main-memory bandwidth,
compute units such as cores would be underutilized especially
during the early stages (layers) of CNN processing (Figure 1)
whereas memory stays idle while processing the later layers,
leading to suboptimal acceleration performance if all the cores
process the same layer together. Therefore, it is critical to devise
a solution that spreads memory requests more evenly over time
to reduce the gap between peak and average demands.
3 STATISTICAL MEMORY TRAFFIC SHAPING BY PARTI-
TIONING COMPUTE UNITS
In this paper, we focus on alleviating the temporal fluctuation
of memory bandwidth demands on manycore-based CNN ac-
celerators. Similar observations and solutions can be applied
to other accelerator types supporting concurrent execution of
multiple contexts (e.g., NVIDIA Volta [8]). Cores in manycore
processors typically share higher levels of the memory hierar-
chy (e.g., last-level caches or main memory). This enables 1)
3(a) Unlimited memory BW (No Partition)
B
W
L2
L3
L4
(b) Limited memory BW (No Partition)
(c) Limited memory BW (2 Partitions)
B
W L2
B
W
Time
Delay due to BW limitation
Delay due to additional
memory requests
Time
Time
L4
L5
L1
Core 2
Core 3
Core 4
Core 1
L3
L1
Core 4
Core 3
Core 2
Core 1
L4/L5
L2/L3L1/L2
Core 3
Core 4
L5/L1
Core 1
Core 2
L5
L3/L4
Fig. 3: A simple illustration-purpose example of memory traffic
distribution over time. When peak memory bandwidth is (a) unlim-
ited, (b) limited, and (c) limited but using the proposed technique.
certain data in a shared memory level to be used by multiple
cores (effect1) and 2) the cores with time-varying degree of
bandwidth and capacity demands to utilize the resource more
effectively (effect2). Previous studies in CNN acceleration
focused on exploiting effect1 to further enhance the degree
of data reuse. [16] compared the options of sharing kernels or
images, and advocated sharing kernels as it can better exploit
the producer-consumer locality of images among CNN layers.
Our reference implementation [5] also shares kernels among
cores; but instead of distributing partitioned images to the
cores, it allocates different images in a batch to different cores.
This, however, tightly couples a group of cores sharing
kernel weights. In CNN acceleration, layer processing is highly
sequential as a layer takes input channels that are produced
by its immediately previous layer. As more cores participate
in a group, compute to memory access ratio increases due to
a higher degree of data (weights) reuse, but the cores operate
highly synchronously as they process the same layer together.
In an extreme case of all cores in an accelerator composing
a single group, the configuration for data listed in Figure 1,
only a single layer is processed at any given time; therefore
effect2 cannot be well exploited and the huge variation in
main-memory bandwidth demands across layers can become a
serious problem.
If accessing kernel weights takes a large portion of main-
memory bandwidth utilization, it is more beneficial to max-
imize kernel weight reuse by increasing the core group size
even if it leads to more severe bandwidth fluctuation over time.
However, the impact of kernel weights on total memory band-
width diminishes as CNN models advance. Figure 2 shows
the ratio of kernel weights over total memory accesses for
the convolutional and fully-connected layers of the ImageNet
Large Scale Visual Recognition Challenge winners. The number
of layers increases, the size of convolution filters decreases, and
a layer often receives feature maps from multiple of previously
calculated layers. These all contribute towards reducing relative
portion of memory bandwidth demands due to kernel weights.
We exploit this trend of smaller impacts of kernel weights
on main-memory accesses by separating the compute cores
in a CNN accelerator into multiple partitions. Then we make
the cores in each partition process the assigned batch syn-
chronously, but we allow a partition to operate asynchronously
against the other partitions. This slightly sacrifices the degree of
data reuse because kernel weights are not shared among multi-
ple partitions and hence should be loaded from main memory
per partition. However, its effect of better temporal bandwidth
balancing through statistical memory traffic shaping [15] can
outweigh the overhead.
Figure 3 depicts the impact of this statistical memory traffic
shaping on CNN accelerator performance with an illustration-
purpose example where memory bandwidth demands from
four cores vary depending on the layers they are processing.
With an unlimited bandwidth, the execution times of cores
are not affected by their main-memory bandwidth demands
(Figure 3(a)). For a realistic system with a limited bandwidth,
however, it takes much longer for the cores to execute layers
demanding more main-memory bandwidth (L1 and L3). When
the cores are not partitioned (Figure 3(b)), all four cores should
be synchronized in layer boundaries. When the cores are di-
vided into two partitions (Figure 3(c)), the execution of core 3
and core 4 can be on different layers as the partitions operate
independently. Then, the memory bandwidth demands from
the cores can be distributed such that the aggregate bandwidth
demands are always below the peak bandwidth provided by
the accelerator. Even if there exists an additional memory traffic
due to a lower degree of data reuse in accessing kernel weights,
as far as its overhead on performance is smaller than the
overhead due to the lack of memory traffic shaping effect, a
partitioning would be beneficial.
4 EVALUATION
Experimental setup: To quantify the performance improve-
ment, three popular CNN designs were tested: VGG-16 [10],
GoogleNet [14], and ResNet-50 [4]. The numbers of layers were
chosen to be 16, 22, and 50, respectively. As for the processor,
an Intel Knights Landing (Xeon Phi 7210) with 64 cores was
used. It has a peak single-precision arithmetic performance of
6 TFLOPS, and it is equipped with MCDRAM that can achieve
up to 400GB/s. We used Caffe [6] as the CNN framework and
utilized Intel’s math kernel library (MKL-DNN v0.1 [5]) as it
is highly optimized for Intel Knights Landing. To measure the
bandwidth utilization, hardware profiling was used.
To test the proposed strategy, the 64 cores were divided
into 2, 4, 8, and 16 partitions and the memory utilization and
calculation speed were measured for each partition size. To
keep the number of images loaded on DRAM to be constant,
64/n images were assigned to a partition as a batch where n is
the number of partitions. In this way, a total of 64 images are
processed by the entire processor at any time. Because of the
limitation of MCDRAM capacity (16GB), results up to 8 parti-
tions are provided for VGG-16, and up to 16 for GoogleNet and
ResNet-50. DRAM size can become the performance bottleneck
for a larger number of partitions, but we exclude such situations
because DRAM size tradeoff is not in the scope of this study.
Note that VGG-16’s DRAM saturates faster because it needs a
larger space for loading all of its weights.
Results: The baseline performance of synchronous data
reuse is investigated first. In Figure 4, the average and stan-
dard deviation of memory bandwidth usage are shown for
ResNet-50 with no partition. The image batch size for data
reuse is the same as the number of cores such that each core
processes a single image per weight loading from DRAM.
It can be observed that the standard deviation increases as
the number of cores increases. This is expected because more
cores are equivalent to more concurrently processed images,
40
20
40
60
80
1.6
1.8
2.0
2.2
2.4
No-P
(4-core)
No-P
(8-core)
No-P
(16-core)
No-P
(32-core)
No-P
(64-core)
ResNet-50
S
t a
n d
a r
d  
D
e v
i a
t i o
n  
o f
M
e m
o r
y  
B
W
 ( G
B
/ s
)
A
v e
r a
g e
 M
e m
o r
y  
B
W
 
p e
r  C
o r
e
( G
B
/ s
)
Average Memory BW per Core SD of Memory BW
Fig. 4: Average memory bandwidth per core and standard devia-
tion of memory bandwidth for increasing number of cores. Plotted
for ResNet-50.
0.60
0.70
0.80
0.90
1.00
1.10
1.20
0.95
1.00
1.05
1.10
1.15
1.20
1.25
N
o -
P
2 -
P
s
4 -
P
s
8 -
P
s
N
o -
P
2 -
P
s
4 -
P
s
8 -
P
s
1 6
- P
s
N
o -
P
2 -
P
s
4 -
P
s
8 -
P
s
1 6
- P
s
VGG-16 GoogleNet ResNet-50
R
e l
a t
i v
e  
S
t a
n d
a r
d  
D
e v
i a
t i o
n
R
e l
a t
i v
e  
P
e r
f o
r m
a n
c e
a n
d  
A
v e
r a
g e
 M
e m
o r
y  
B
W
 Performance Average Memory BW SD of Memory BW
Fig. 5: Relative performance, standard deviation of memory band-
width, and average of memory bandwidth for increased partition
sizes. Shown for VGG-16, GoogleNet, and ResNet-50.
0.0
0.2
0.4
0.6
0.8
1.0
0 20 40 60 80 100 120 140 160 180 200 220 240
No-P 4-Ps 16-Ps
Time (ms)
M
e m
o r
y  
B
W
 U
t i l
i z
a t
i o
n
Fig. 6: Memory bandwidth utilization over time for no-P, 4-Ps, and
16-Ps. Plotted for ResNet-50.
and therefore a larger fluctuation in the absolute size of total
bandwidth usage (in GB/s). As the standard deviation becomes
larger, there is a higher chance of the memory bandwidth
becoming the performance bottleneck. This leads to a decrease
in the average memory bandwidth usage per core as shown in
Figure 4, because more time needs to be spent for waiting in the
queue. Note that this memory bandwidth bottleneck problem is
expected to become even more crucial when compute capability
is further improved with 16- and 8-bit operations.
To address the memory bottleneck problem, we applied the
proposed partitioning strategy to the three CNN models, and
present the relative performance results in Figure 5. For VGG-
16, GoogleNet, and ResNet-50, standard deviation is reduced
by up to 20.0%, 37.6%, and 36.2%, respectively. This confirms
that the fluctuation is reduced by increasing the partition size. It
is a direct result of statistical traffic shaping over asynchronous
partitions, and thus the average bandwidth usage, i.e., memory
bandwidth utilization efficiency, is also improved by 18.7%,
22.7%, and 15.2%, respectively. Eventually, the overall perfor-
mance is improved by 3.9%, 11.1%, and 8.0%.
The partitioning improves performance of all three CNN
models. The performance improvement comes despite of less
weight data reuse, because the bandwidth issue is more critical
for the tested cases. For the set of chosen test scenarios, the
performance is steadily improved as the number of partitions
is increased except for VGG-16’s 8 partitions. In general, we ex-
pect the performance to deteriorate as the number of partitions
becomes too large, but the limitation on DRAM size prevented
us from testing such scenarios. The performance improvement
is most significant when partition size is increased from 1 (no
partition) to 2. This is because the reduction of fluctuation by
traffic shaping is most significant for the case. Figure 6 shows
the memory bandwidth utilization of no partition, 4 partitions,
and 16 partitions for ResNet-50. Without partitioning, memory
bandwidth utilization severely fluctuates. For 16 partitions,
however, the bandwidth utilization becomes relatively steady.
5 CONCLUSION
For CNN acceleration, a synchronous use of cores has been
considered as a desirable solution because of its data reuse
efficiency. In this work, we have shown that such a syn-
chronous use can have a downside of a memory bandwidth
bottleneck problem, especially for the latest CNN algorithms
whose weight data reuse is less critical. To provide a mecha-
nism for trading data reuse efficiency with memory bandwidth
utilization efficiency, we proposed a partitioning strategy where
compute units are divided into multiple partitions and different
partitions run asynchronously. The strategy was tested over
VGG-16, GoogleNet, and ResNet-50 using Intel Knights Land-
ing processor with 64 cores. The evaluation results show that
the standard deviation of memory bandwidth usage is reduced
by 20.0-37.6% and the average is increased by 15.2-22.7%. This
indicates that a statistical traffic shaping is achieved and the
memory bandwidth is better utilized on the average. Overall,
CNN acceleration performance is improved by 3.9-11.1%.
ACKNOWLEDGMENTS
This work was partially supported by the National Research
Foundation of Korea grant funded by the Korea government
(NRF-2017R1A2B2005416 and NRF-2017R1E1A1A03070560).
REFERENCES
[1] A. M. Caulfield et al., “A Cloud-Scale Acceleration Architecture,”
in MICRO, 2016.
[2] D. Foley and J. Danskin, “Ultra-Performance Pascal GPU and
NVLink Interconnect,” Micro, IEEE, vol. 37, no. 2, Mar/Apr 2017.
[3] I. Goodfellow et al., Deep Learning. MIT Press, 2016.
[4] K. He et al., “Deep Residual Learning for Image Recognition,” in
CVPR, 2016.
[5] Intel, “Intel(R) Math Kernel Library for Deep Neural Networks,”
2016. [Online]. Available: https://github.com/01org/mkl-dnn
[6] Y. Jia et al., “Caffe: Convolutional Architecture for Fast Feature
Embedding,” arXiv:1408.5093, 2014.
[7] N. P. Jouppi et al., “In-datacenter Performance Analysis of a Tensor
Processing Unit,” in ISCA, 2017.
[8] NVIDIA, “NVIDIA Tesla V100 GPU Architecture,” 2017. [Online].
Available: http://www.nvidia.com/volta
[9] Y. Shen et al., “Escher: A CNN Accelerator with Flexible Buffering
to Minimize Off-Chip Transfer,” in FCCM, 2017.
[10] K. Simonyan and A. Zisserman, “Very Deep Convolutional Net-
works for Large-Scale Image Recognition,” arXiv:1409.1556, 2014.
[11] A. Sodani et al., “Knights Landing: Second Generation Intel Xeon
Phi Product,” Micro, IEEE, vol. 36, no. 2, Mar/Apr 2016.
[12] Y. H. Son et al., “Microbank: Architecting Through-Silicon
Interposer-Based Main Memory Systems,” in SC, 2014.
[13] C. Szegedy et al., “Inception-v4, Inception-ResNet and the Impact
of Residual Connections on Learning,” arXiv:1602.07261, 2016.
[14] C. Szegedy et al., “Going Deeper with Convolutions,” in CVPR,
2015.
[15] A. S. Tanenbaum et al., Computer Networks. Prentice Hall, 2010.
[16] X. Yang et al., “A Systematic Approach to Blocking Convolutional
Neural Networks,” arXiv:1606.04209, 2016.
