MERIT: Tensor Transform for Memory-Efficient Vision Processing on
  Parallel Architectures by Lin, Yu-Sheng et al.
1MERIT: Tensor Transform for Memory-Efficient
Vision Processing on Parallel Architectures
Yu-Sheng Lin, National Taiwan Univ., Wei-Chao Chen, Skywatch Inc. and Inventec Inc.,
and Shao-Yi Chien, National Taiwan Univ.
Abstract—Computationally intensive deep neural networks
(DNNs) are well-suited to run on GPUs, but newly developed
algorithms usually require the heavily optimized DNN routines to
work efficiently, and this problem could be even more difficult for
specialized DNN architectures. In this paper, we propose a math-
ematical formulation which can be useful for transferring the
algorithm optimization knowledge across computing platforms.
We discover that data movement and storage inside parallel
processor architectures can be viewed as tensor transforms across
memory hierarchies, making it possible to describe many memory
optimization techniques mathematically. Such transform, which
we call Memory Efficient Ranged Inner-Product Tensor (MERIT)
transform, can be applied to not only DNN tasks but also many
traditional machine learning and computer vision computations.
Moreover, the tensor transforms can be readily mapped to exist-
ing vector processor architectures. In this paper, we demonstrate
that many popular applications can be converted to a succinct
MERIT notation on GPUs, speeding up GPU kernels up to 20
times while using only half as many code tokens. We also use
the principle of the proposed transform to design a specialized
hardware unit called MERIT-z processor. This processor can be
applied to a variety of DNN tasks as well as other computer vision
tasks while providing comparable area and power efficiency to
dedicated DNN ASICs.
Index Terms—Neural network hardware, vector processors,
parallel programming.
I. INTRODUCTION
Recently, deep learning technology has gained remarkable
success in many fields such as computer vision, image pro-
cessing, and natural language processing. The computations
of deep neural networks (DNNs) tend to be fairly regular, and
this means they are well-suited for highly-parallel processors
such as general purpose graphic processing units (GPGPUs).
While GPU programming languages like CUDA and OpenCL
have reached maturity and shown their promises in many
scientific fields, it still requires specialized knowledge in
GPU architectures in order to implement DNN algorithms
efficiently. In some cases, libraries are implemented with
insider knowledge that is not available to the public, such as
cuDNN from NVIDIA [1].
For the purpose of efficient DNN computation, a re-
cent trend is to create application-specific integrated circuits
(ASICs), as they can often achieve lower power consumption
or higher performance compared to general processors. Popu-
lar designs include systolic arrays due to its ability to directly
exchange data between processors [2], [3] and GPU-like vector
architectures with reduced arithmetic precision and larger on-
chip memory [4]. For extremely low power devices, ASICs can
also help reduce memory pressure by processing compressed
Y.S. Lin and S.Y Chien are with the Graduate Institute of Electronics
Engineering, National Taiwan University.
neural networks. These techniques include binarization [5],
[6], dictionary-based compression [7], and pruning-based en-
tropy compression [8]. While ASICs are more power efficient
than GPUs, they are not flexible enough to keep up with
the evolution of DNN techniques [9]–[18]. The new types
of computations can limit the widespread adoption of DNN
ASICs since the arithmetic units and memory organization
tend to be highly optimized for specific types of networks,
which makes it difficult or impractical to map these new
network layers to ASICs.
Even though the processors come in different varieties, the
principles for optimizing data-intensive algorithms are always
the same – bring data closer to the processing elements. For
this purpose, DNN processors often do not employ traditional
memory hierarchies. Instead, programmers are given the free-
dom to explicitly control data movement between different
types of memories, such as local, on-chip, or off-chip memory
buffers. Unfortunately, this means each algorithm needs to be
optimized for each new processor by redesigning a new data
movement procedure, causing repeated and wasted efforts.
To solve this problem, we propose the Memory Efficient
Ranged Inner-Product Tensor (MERIT) transform which can
incorporate data movement patterns into algorithms, as shown
in Fig. 1a. A MERIT transform converts one matrix (or
generally, one tensor) into another by performing permutation
and repetition of the matrix elements. This allows us to convert
computational tasks into vector-wise products or reductions,
as shown in Fig. 1b. An efficient MERIT transform therefore
conforms to the steps of data journey through the memory hi-
erarchy, and each step perform a partial transform representing
a data movement procedure, such as tiling, shuffling, or bank
conflict avoidance, as shown in Fig. 1c. These partial trans-
forms correspond to common resource constraints of GPUs or
DNN architectures, such as limited on-chip SRAM or partially
connected data networks. Therefore, when a general vision-
related computation can be reduced to a MERIT transform,
many memory optimization techniques can be implemented
on a particular architecture with very little effort.
To demonstrate the usefulness of MERIT, we use it to opti-
mize several algorithms on GPUs and ASICs, including bilat-
eral filter [19], motion estimation, and many DNN layers [13],
[16]–[18], [20]–[22]. For GPUs, we implement MERIT for
CUDA as a header file, which can be easily integrated into
existing algorithms to provide higher performance than many
hand-tuned implementations. The simplicity of the MERIT
representation, and its strong connection between mathematics
and architecture, means that we can also use it to create
hardware with relatively little cost and effort. We adopt this ap-
proach to design an open-source ASIC processor, the MERIT-
ar
X
iv
:1
91
1.
03
45
8v
1 
 [c
s.D
C]
  7
 N
ov
 20
19
2General Vision Tasks
...
Convolution
Matrix Multiplication
A B
A *
×
B
Architecture-Specific Data Movement
for Algorithm Optimization
Data
MERIT
Transform
TensorsMotion Estimation
Uniform
Computation
Data
Input Move data Mem. Resources Compute Results
Easy to code. See (b)Contribution! See (c)
Traditional
Data
Programming is hard
Equal
Data
Algorithm
Ours
(a)
M(A)
M(B)
dot(     ,     )
(Reshape)
Results
Run on SIMD
Math: Vector-wise product
GPU or ASIC
Benefits:
● Little coding efforts
● Small program memory
● Mapped to parallel
architectures easily
Tensors
Hardware
MERIT
Abstract
(b)
MERIT Transform
M(A) = μn...μ2μ1(A)
M(A)A
Map partial transforms to hardware
Tiling (μ1) Shuffling (μ3)
BufferData(DRAM)
Conflict Free
Access (μ2)
SRAM
SRAM
SRAM
SRAM
Math: Tensor transform series
Tensors
Input data
(c)
Fig. 1. An overview of the MERIT transform. (a) Reduce optimization efforts by abstracting the data movement stage into a tensor transform. (b) Convert
computing tasks into a SIMD-friendly arrangement. (c) MERIT tensor can be mapped to hardware memory hierarchy for efficient execution.
z, that uses classic, efficient circuits such as butterfly networks
to achieve efficient DNN and computer vision performance.
Specifically, our contributions are as follows:
• MERIT transform – a data transform methodology that
can succinctly describe data movement and reduce opti-
mization efforts in parallel processing,
• MERIT for CUDA – an open-source API which produces
fast GPU kernels with fewer code tokens compared to
na¨ive CPU implementations, and
• MERIT-z processor – an open-source general vector
processor designed with insights gained from the MERIT
transform, supporting both common DNN layers and
traditional vision processing.
In Section II, we shall next build up the background
knowledge for readers. In Section III, we formally define
the MERIT transform and illustrate how to express different
vision workloads with the transform. Based on such transform
definition, in Section IV and V, we show how to map parallel
workloads to an efficient CUDA execution and onto the
MERIT-z processor. Finally, we show the experiments results
and sum up the paper.
II. RELATED WORKS
Kernel transform for parallel program optimization.
For parallel architectures, the performance of a computation
task is usually limited by the memory bandwidth to the
computing units. While there are many common principles
for memory optimization, it has never been easy to optimize
an algorithm on a particular parallel architecture, and there
has been extensive research to tackle this problem.
One approach is to design a new, domain-specified pro-
gramming language (DSL) that can describe applications in
specific domains, such as Halide [23] and TVM [24]. An
algorithm written in DSLs usually requires fewer lines of code
than those in generic languages like C++ or OpenCL. DSLs
can also be compiled to target specific types of hardware,
and the process often involves searching for parameters for
different memory configurations like [25]. This means the
design of DSL compilers and their searching strategies require
significant efforts, and their language features are less stable
because they can evolve more rapidly compared to generic
languages. Some DSLs also involve advanced programming
concepts, making it difficult appeal to the wider audience. For
example, Halide defines images as N-dimension functions and
describes the image pipelines as functional compositions. In
TVM, the computation schedule is stored in symbolic variables
for programmers could use to build algorithms.
Another simpler approach is to convert a new computation
problem into an existing one. For example, a common im-
plementation for the CONV layer in deep learning on GPUs
3is to unroll, or flatten, patches of an image to columns of
a matrix. This transforms the CONV layer into a GEneral
Matrix-Matrix multiplication (GEMM), a routine that has been
heavily optimized in existing libraries such as cuBLAS. This
technique has also been applied to ASIC design in order to
support CONV on GEMM architectures [2], [26]. Conversion-
based methods are easier to implement since programmers
only need to write the conversion routine to take advantage
of the well-optimized routines. However, not every problem
can be easily converted, and the process may sometimes
introduce significant overhead which shall be discussed further
in Section III.
The third approach is to divide algorithms into combinations
of interchangeable function blocks. Programmers implement
algorithms by connecting different blocks without having to
understand how to actually schedule the computation. For
example, in a GPU, programmers write the vertex and frag-
ment shaders to describe object transform and color shading
rules, and the 3D objects are rendered to the 2D screen
automatically by the processor. MapReduce [27] decomposes
large-scale text-processing tasks into data sorting and two
custom functions, map and reduce. Convolution Engine [28]
is a MapReduce-based streaming image processor with a
restricted map and reduce operators. UMI Operator [29] shows
several vision-related computations can be described by a
custom unrolling procedure and a generalized inner-product
function.
Later in this paper, we show that, for many vision re-
lated kernels or algorithms, we can decompose their data
movement processes such that algorithms can run on simple
SIMD architectures. This transform is related to the unrolling
above and can be defined with a few parameters for the data
network hardware to execute efficiently. We also propose a
methodology for mapping this transform to GPUs and ASICs,
which can greatly reduce the optimization effort for parallel
programs.
Hardware for image, vision, and DNN. Here we
review several DNN, machine learning accelerators, and vision
processors published in mainstream conferences for graphics,
computer architecture, and circuit technology. For a more
comprehensive review of DNN accelerators, we recommend
the readers to consult survey papers such as [30], [31].
Dedicated DNN ASICs are usually designed for speeding
up either FC or CONV layers, but they can also support other
types of layers via conversion. For example, DianNao [26]
uses a (16 × 16)-by-(16) matrix-vector processor to compute
both FC and CONV layers, and it needs to unroll the feature
maps in the input buffers in order to evaluate the CONV
layers. TPU [2] uses systolic arrays and is able to perform
a variable (256× 256)-by-(256×N ) matrix-matrix operation,
and it supports CONV layers by storing the unrolled feature
maps within 256 separate queues. Eyeriss [3] holds a row
in the processor and adds a path to broadcast rows for the
CONV layers, but this path is not utilized by the FC layers.
In a more recent work [32], separate hardware units are used
to process FC and CONV layers independently. MAERI [33]
adopts a tree-based distribution and reduction path to support
more types of DNN layers, but it can be difficult to configure
Matrix A Matrix B M(A) M(B)
g
j
m
h
k
n
i
l
o
a
d
b
e
c
f
a b c
a b c
d e f
d e f
(0)
(1)
(2)
(0,0)
(0,1)
(0,2)
(1,0)
(1,1)
(1,2)
(0)
(1)
(2)
(0,0)
(0,1)
(0,2)
(1,0)
(1,1)
(1,2)
d e f
a b c g j m
h k n
i l o
g j m
h k n
i l o
MatMul (2,3) and (3,3) and 
get (2,3).
MERIT Transformed domainOriginal computation
Pa
ra
lle
lis
m
Fig. 2. A matrix multiplication operation in the parallel form using MERIT.
these trees.
There exist ASICs designs that aim to support more gen-
eral parallel computation tasks. Convolution Engine [28] and
PuDianNao [4] use the fixed function pipeline architecture
and support different types of computations by enabling dif-
ferent functional units. Reconfigurable interconnection is yet
another example, where FPGAs are dynamically programmed
to execute various computing tasks [34], [35]. In CRISP [36]
streaming processor, different functional units are connected
by a programmable crossbar, and various image processing
methods could be applied as image pixels are streamed into
the processor.
Register File Cache prevents the round trip between pro-
cessors and the SRAM, improving the overall throughput and
reducing the power consumption [37]. Affine Warp detects the
regular register values across parallel threads, minimizing the
redundancy of computing and storing these registers [38]. The
design of our MERIT-z processor is related to these GPGPU
optimization research, except that our architecture is derived
through insights from the mathematical framework presented
in this paper.
III. THE MERIT TRANSFORM
The MERIT transform, denoted as M(·), is a mathematical
process for converting a tensor into another, such that a given
algorithm can be transformed into the SIMD domain for easier
operation.
As an example, we can apply the MERIT transform to the
GEMM algorithm C = AB and obtain
Vec(C) = R(M(A),M(B),), (1)
where R(X,Y,) means applying dot-product  to every
row of (X,Y), Vec(C) is the vectorized form of matrix C,
and the MERIT transforms M(A), M(B) are created by
repeating rows or columns of the input matrices (Fig. 2).
While this formulation appears to complicate matters, there are
several computational advantages to this expression. First, each
dot-product can be computed independently, which means
this equation can be evaluated in parallel. Second, M(·) and
 can both be specified by the users, making this equation
configurable for general computational tasks. Third, elements
in the transformed matrices are copies of the original ones,
which means M(·) is a pure data movement operation, leaving
 as the only real arithmetic operations in the equation.
Because the MERIT transform M(·) implies data move-
ment, it is important to make sure M(·) maps to actual
4Unrolled A
Image A Kernel B
o
m
p
ncba d
gfe h
kji l
o
m
p
n
ba fe
cb gf
c d g h
fe ji
gf kj
g h k l
Equivalent GEMV implementationDNN CONV layer
Vectorized B
* ×
Equivalent
Convolve (3,4) and (2,2) and 
get (2,3). Produce a vector of length 6 that 
can be reshaped to (2,3)
(a) Convolution via unrolling.
M(A)Image A M(B)Kernel B
(0,0)o
m
p
ncba d
gfe h
kji l (0,1)
(0,2)
(1,0)
(1,1)
(1,2)
(0,0)
(0,1)
(1,0)
(1,1)
(0,0)
(0,1)
(0,2)
(1,0)
(1,1)
(1,2)
(0,0)
(0,1)
(1,0)
(1,1)
om pn
om pn
om pn
om pn
om pn
om pn
ba fe
cb gf
c d g h
fe ji
gf kj
g h k l
Convolve (3,4) and (2,2) and 
get (2,3).
MERIT Transformed domainOriginal computation
(b) Convolution in MERIT representation.
Fig. 3. A convolution operation in parallel form.
hardware to achieve optimal performance. For this purpose, we
factorize the MERIT transform M(A) = µn · · · ◦µ2 ◦µ1(A),
where each partial transform, or sub-step µi, is a repetition
and permutation on the input data (Fig. 1c). For a particular
architecture, we would need a particular set of sub-steps to
reflect its memory and data-path design such as multi-bank
SRAM or DRAM. Fortunately, these sub-steps can often be
reused for the same architecture across different algorithms,
and we shall discuss the subject of efficient MERIT transform
in Section IV. Before then, let us start with a few more
examples to familiarize the readers with the benefits of this
transform.
A. Convolution on SIMD Processors
Fig. 3a shows a popular technique for accelerating convo-
lution on SIMD processors [39]. This process converts con-
volution into GEneral Matrix-Vector multiplications (GEMVs)
by unrolling the input image A and vectorizing the kernel B,
which can be written as
Vec(C) = U(A)×Vec(B). (2)
This is very effective because it takes advantage of the well-
optimized GEMV routines on SIMD processors. However,
unrolling creates an unnecessarily large input matrix U(A),
which is clearly sub-optimal.
A more optimal process would be to replicate data on-the-
fly. This means we could assign M(A) ≡ U(A), and expand
the data during the transfer process, as shown in Fig. 1c. Rather
than expanding the image before the input stage, M(·) expands
during data movement, such that replication of data can occur
as late as possible to reduce memory bandwidth and buffer
size. Note that in order to map the above equation to (1),
M(·) also needs to be applied to the kernel B (Fig. 3b), but
this is efficient and also done on-the-fly.
B. Extending to General Tensor Product
A MERIT transform is coupled with a product function
 such that (1) represents an algorithm and can run on
SIMD efficiently. We can change this function to adapt the
equation to different algorithms. For the MERIT GEMM and
convolution, we apply the dot-product
v1  v2 ≡ vT1 v2 (3)
to each vector pair of the transformed matrices. Similarly, if
we define
v1  v2 ≡
{
max(vT1 v2, 0)
||v1 − v2||1
, (4)
then we apply our transform to more computations, such as
fused ReLU layer or patch distance algorithms like motion
estimation. If we couple such vector products to MERIT
transform, then we in fact apply vector functions to flattened
tensors from 2D image patches or 3D convolution kernels.
Instead of defining the function  on vectorized tensors
Vec(T1)Vec(T2), a more natural way is to define a tensor
product that computes T1  T2 directly. Our proposal for a
generalized tensor product, called Ranged Inner-Product, is
related to the ideas from [28], [29] where a vector product
is expressed as a strategy class, one of the most common
design patterns. An implementation of the strategy-based
vector product is listed below in Listing 1.
Listing 1.  for the CONV+ReLU layer, as a strategy class.
class ReLUStrategy {
float act;
void PreLoop() {act = 0;}
void Loop(float a, float b) {act += a*b;}
float PostLoop() {return max(act,0);}
};
typedef vector<float> Vector;
template<class Strategy>
float inner_prod(Vector a, Vector b) {
Strategy s;
s.PreLoop();
for (int i = 0; i < a.size(); ++i)
s.Loop(a[i], b[i]);
return s.PostLoop();
}
// This line actually calls the odot function
inner_prod<ReLUStrategy>(...);
Here, programmers can write the ReLU product as a class
consisting of three strategy functions and a for-loop. These
strategy functions perform partial sum initialization, accumu-
lation, and clipping, respectively. Such strategy class, when
combined with MERIT transform, provides high flexibility for
expressing various workloads. For example, if we load another
tensor and add it to act in PostLoop, we then obtain an
inner-product function for residual layer.
Extending the idea of strategy beyond vectors, programmers
can use a more complex class and nested for-loops to imple-
ment a 2D Ranged Inner-Product, which is a 2 × 2 for-loop.
As shown in Fig. 4a, For 3D or higher dimensional tensors,
however, this would not be suitable because of the resulting
deeply nested for-loops, and therefore we propose to linearize
the nested loops. Fig. 4b shows the function sequences being
executed for the 2D case, where each index corresponds
to a certain execution order of the functions. Fig. 4c and
5PreCol
PreRow
Loop
PostRow
PostCol
9
5
0
Starting
addresses
(c) Select executed range by indices
in the concatenated program
PreCol
PreRow
Loop
Loop
PostRow
PreRow
Loop
Loop
PostRow
PostCol
25
22
16
(0,0)
9
5
0
25
22
16
(b) The actual
execution of (a)
PreCol()
for i in 0:1
  PreRow()
  for j in 0:1
    Loop()
  PostRow()
PostCol()
(a) Strategy-based
inner-product
(0,1)
(1,0)
(1,1)
Ending
addresses
(i,j)
(0, 0)
(0, 1)
Current loop index (F, T)
(T, F) 9
5
0
25
22
16=
First loop index =
(1, 1)
Last loop index Use prefix-sum to find the first false
(d) The selection process of (c)
Concatenated
program
Fig. 4. Ranged Inner-Product. Execute a 2D tensor product by flattening
ranges of a strategy class. It extends the strategy class of Listing 1 to the 2D
case, so there are five strategy functions in (a) instead of three.
4d show the efficient Ranged Inner-Product implementation.
First, the strategy functions are concatenated to form a single
program, and their starting and ending addresses are stored
as tables. Given the loop indices (i.e., i, j), the starting and
ending addresses are selected from the table, which decides
the function range that should be executed for those indices.
To support higher-dimension execution, we can simply enlarge
the address tables and implement the selection using parallel
prefix-sum.
C. Putting It Together
In summary, the MERIT transform alters an input tensor
into another tensor, for the purpose of converting parallel al-
gorithms into simpler parallel tensor reductions, or the Ranged
Inner-Products. This means that in the previous examples
(Fig. 2 and Fig. 3b), the matrices M(A) and M(B) are in
fact tensors flattened into 2D matrices. This flattening process
is based on splitting the tensor indices into two parts. The row
indices p reflect the algorithm parallelism, and column indices
a reflect the number of accumulated elements in a tensor
product. In these figures, the rows and columns of M(A)
and M(B) are assigned with grid indices like (0, 0), (0, 1),
(0, 2), (1, 0) · · · . We then use the notation M(A)p,a to refer
to one element in a transformed tensor, and for convenience,
we denote the concatenated indices as k = (p,a). Changing
kj , the j-th component of k, which belongs to either p or a,
is equivalent to moving along a particular axis dj of A with
a certain stride sj and offset oj ,
M(A)p,a = Ax, where xi =
∑
j
δi,dj (kjsj + oj), (5)
M(A)Image A M(B)Kernel B
(0,0)
m ncba
gfe
(0,1)
(0,2)
(1,0)
(1,1)
(1,2)
(0,0)
(0,1)
(1,0)
(1,1)
(0,0)
(0,1)
(0,2)
(1,0)
(1,1)
(1,2)
(0,0)
(0,1)
(1,0)
(1,1)
m n
m n
m n
m n
ba
cb
fe
gf
MERIT Transformed domainOriginal computation
(a) Tiled convolution.
Matrix A Matrix B M(A) M(B)
g
j
h
k
a
d
b
e
a b
d e
(0)
(1)
(2)
(0,0)
(0,1)
(0,2)
(1,0)
(1,1)
(1,2)
(0)
(1)
(2)
(0,0)
(0,1)
(0,2)
(1,0)
(1,1)
(1,2)
d e
a b g j
h k
g j
h k
MERIT Transformed domainOriginal computation
(b) Tiled matrix multiplication.
Fig. 5. MERIT transform with tiling. Practical limits of computational
resources dictate processing of subsets of the workload, which is equivalent
to dividing the MERIT tensor into sub-tensors.
where δ is the Kronecker Delta. The equation above first cal-
culates the indices x through (p,a) and parameters dj , sj , oj ,
then x is used to locate a specific element Ax in A.
Based on the discussions above, a MERIT transformed ten-
sor should reflect the total complexity and available parallelism
of a computation. For example, an (m, k)-by-(k, n) GEMM
problem has a total complexity of Θ(mnk) with Θ(mn)
available parallelism, and the sizes of the transformed tensors
are ((m,n), (k)). Similarly, a (k×k)-convolution problem on
an (h × w) image has a total complexity of Θ(hwk2) with
Θ(hw) available parallelism, and the sizes of the transformed
tensors are ((h,w), (k, k)).
We show an example for the MERIT transform of
AlexNet CONV1. This layer adopts a stride size 4 and a
11× 11 kernel size, so it can be written as
(p1, p2, p3, a1, a2, a3) ∈ NDRange(48, 55, 55, 3, 11, 11){
M(I)p1,p2,p3,a1,a2,a3 = Ia1,4p2+a2−5,4p3+a3−5
M(k)p1,p2,p3,a1,a2,a3 = kp1,a1,a2,a3
,
(6)
where I and k is the input feature map and convolution
kernel. Note that here we implicitly define the xi of (5) in
the subscripts of right hand side expression.
Apart from CNN with stride, we can express several popular
CNN variants with MERIT transform definition. For example,
the dilated CNN [13] which can effectively increase the
receptive field is defined by
M(I)p1,p2,p3,a1,a2,a3 = Ia1,p2+2a2,p3+2a3 . (7)
The correlation layer [16] is a CNN layer for optical flow. It
simulates the traditional motion estimation computation and
can be expressed with{
M(I1)p1,p2,p3,p4,a1 = I1a1,p1,p2
M(I2)p1,p2,p3,p4,a1 = I2a1,p1+p3,p2+p4
. (8)
6IV. EFFICIENT MERIT TRANSFORM ON GPUS
Modern GPUs have SIMD-like designs that are suitable
for workloads targeted by the MERIT transform. However,
because of the intricate design of GPU memory hierarchy, care
must be taken while fetching data, as the latency introduced
by cache misses tends to be fairly high. Traditionally, in order
to maximize data locality, programmers are responsible for
dividing workloads into independent sub-tasks and plan the
data fetches accordingly. These very tedious tiling or blocking
optimization processes can be replaced by factorized MERIT
transforms, as shall be explained in the remainder of the
section.
A. Tiled Parallel Execution
With the MERIT transform, supporting tiled execution can
be simply achieved by dividing M(A) and M(B) into sub-
tensors of sizes (tp, ta). Fig. 5 highlights some sub-tensors
from Fig. 3b and Fig. 2 with (tp, ta) = ((2, 2), (1, 2))
and ((2, 2), (2)), respectively. It can be observed that for an
arbitrary tensor A and its transformed tensor M(A), each
sub-tensor of M(A) can be fully contained by a minimal sub-
tensor of A, whose size at the i-th dimension is calculated by
this equation
1 +
∑
j
(tj − 1)sjδdj ,i, (9)
where dj and sj represent the same axes and strides as in (5),
and tj denotes the size at the j-th dimension of the sub-tensor
in the M(A) domain. This allows us to compute the memory
footprint of the sub-tensor. For example, using a 5-by-5 kernel
in convolution in order to obtain a 16-by-8 pixel output, we
assign (tp, ta) = ((16, 8), (5, 5)) in (9), which gives the
memory size requirement of (1+1(16−1)+1(5−1), 1+1(8−
1) + 1(5− 1)) = (20, 12). Instead of loading all elements of
M(A) directly from A, we first load its sub-tensor into user-
addressable shared memory, and programmatically transform
the tensor to the M(A) domain on-the-fly, thereby eliminating
all memory bandwidth overhead caused by duplication, since
all data are load from the shared memory.
B. Bank Conflict Avoidance
In GPUs, the shared memory may consist of many SRAM
banks, and the number of banks is usually designed to match
the number of processor cores. When the SIMD requests
elements from the same SRAM at the same time, the system
performance drops drastically, causing the bank-conflict prob-
lem. To avoid this problem, note that the vertical direction
of the MERIT transformed matrices reflects the parallelism.
This property reduces bank-conflict problem into the analysis
of column vector addresses within the sub-tensors of M(A).
Fig. 6 illustrates examples for resolving bank-conflict by
padding [40, chap. 39], XOR-hash [41], [42], and our proposed
re-tiling technique. The example shows a vector processor
with 8 ALUs computing a 3 × 3 convolution for a 4 × 4
output block, requiring a 6 × 6 block in total (Fig. 6(i)).
The right parts of the figure illustrate some arrangements
for distributing the 6 × 6 pixel block into 8 SRAM banks
physically. In (ii-a), conflicts are highlighted in red, whereas
bank conflicts in (ii-b) are resolved by padding 6 elements
after every 6 elements. To reduce the padding waste, (ii-c)
uses only 2 padding with XOR-hash, achieved by swapping
the first and last four elements in every other row. (iii) and
(iv) show two different re-tiling that can create conflict-free
scenarios without any hashing or padding requirement. Later
in Section V, we shall provide more details on finding suitable
configurations and for storing and transferring of SRAM data.
We will also discuss our open-source implementation of the
MERIT transform on GPUs in more details in Section VI.
V. HARDWARE DESIGN FOR MERIT
A. The MERIT-z Vector Processor
As shown above, GPUs combined with the MERIT trans-
form framework can greatly simplify the optimization process
for parallel algorithms. However, GPUs also come with units
that introduce unnecessary complexities in the memory hierar-
chy, leading to unpredictable performances. In this section, we
use the insights gained from the MERIT transform and create a
clean-sheet ASIC design aimed to remove these complexities.
When done properly, it can outperform GPUs for memory-
intensive vision processing applications while maintaining the
same programming interfaces as the GPUs.
Fig. 7 shows the proposed MERIT-z processor architecture.
It includes a MERIT Memory Management Unit (MMU) using
classic circuit blocks such as the butterfly network to map
common MERIT sub-steps onto the processor efficiently. It
also adopts a quite standard vector processor design, which
comprises of several Tile Accumulation Units (TAUs) and one
Dispatcher. A TAU is analogous to one Streaming Multi-
Processors on GPUs, and is an array of N = 32 ALUs cores
in our implementation. Our processor can be configured with
variable numbers of TAUs, such that it can be scaled for both
low-power or high-end computing devices. A TAU consists
of three coarse-grained pipelines, namely the Read Pipeline
(RP), Compute Pipeline (CP), and Write Pipeline (WP). The
RPs cache tiles from the DRAM into the SRAMs, transform
and feed them into the ALUs; the WP handles data that need
to be written to the DRAM. Together, RPs and WP form the
MERIT MMU.
Fig. 8 shows the RP which consists of multiple SRAM
banks and is designed to perform sub-steps of the MERIT
transform. The RP caches individual tensors tiles, according
to Fig. 5 from the DRAM, and performs the sub-step transform
on-the-fly when the CP is comsuning the tiles. This involves
reading out the data from N SRAM banks, shuffling, and
feeding the data into N cores. The SRAM acts as a circular
FIFO as shown in Fig. 9. 2-port (TP) SRAMs should be
adequate for this purpose, or single-port SRAMs (SP) can be
used in exchange for performance drop. When the processor
requests a tile, we allocate its corresponding space on the
SRAM and then mark it as ready after receiving the entire
tile from the DRAM. Afterward, the CP can be launched after
all dependent tiles are ready. Meanwhile, tiles that are not
needed anymore are released due to the nature of circular
FIFOs. Because partial tiles are not read out in this design,
7(ii-a)(i)
8-core SIMD
4 52 3
6
0
7
1
2 30 1
4 5 6 7 0 1
2 3 4 5 6 7
4 52 3
6
0
7
1
2 30 1
(ii-b)
4 52 30 1 6 7
2 30 1
2 30 1
2 30 1
2 30 1
2 30 1
4 5 6 7
4 5 6 7
4 5 6 7
4 5 6 7
4 5 6 7
4 52 30 1 6 7 2 30 1
4 5 6 7 4 52 30 1 6 7
4 52 30 1 6 7 2 30 1
4 5 6 7 4 52 30 1 6 7
4 52 30 1 6 7 2 30 1
4 5 6 7 4 52 30 1 6 7
4 52 3
6 7 2 3
4 5 6 7 0 1
2 3 4 5 6 7
4 52 3
6
0
7
1
2 30 1
0 1
0 1
4 52 3
6 7 2 3
4 5 6 7 0 1
2 3 4 5 6 7
4 52 3
6
0
7
1
2 30 1
0 1
0 1
(ii-c) (iv)(iii)
padding
n Pixel resides in SRAM bank nLegend n Dummy pixel resides in SRAM bank n. n Pixel to be processed by n-th ALU.
2 30 1
4 5 6 7
2 30 1
4 5 6 7
2 30 1
4 5 6 7
2 30 1
4 5 6 7
2 3
0 1
4 5
6 7
2 3
0 1
4 5
6 7
bank conflict
Tile division 
scheme
Tile division 
scheme
Tile division 
scheme
Fig. 6. Bank conflict avoidance with MERIT. (i) A 3 × 3 convolution uses a block of 4 × 4 threads with 8 cores, generating a 6 × 6 patch in a 8-bank
shared memory. (ii-a) A naive data layout causes conflict. (ii-b,c) Conflict-free data layouts. (iii,iv) Conflict-free thread grouping with re-tiling technique.
MERIT Processor
TAU
Memory Bus
TAU ...
Dispatcher
CP
SRAM Reg
RP
SRAM
SRAM
SRAM...
RP
SRAM
SRAM
SRAM...
Memory Write
Memory Read
TAU
SIMD
TAU
WP
Fig. 7. The MERIT-z processor architecture.
SRAM
SRAM
SRAM...
Read
Address
Tile
DMA
Tile
DMA Collector
Data
To CP
Address
Generator
Read
Data
Controller
Partial transform 
μ
2
: Expansion
Partial transform 
μ
1
: Tile caching
Fig. 8. The Read Pipeline (RP). It executes sub-steps of a factorized MERIT
transform, including two transforms representing data movement from DRAM
through SRAM to SIMD. The tensors are expanded as late as possible to
minimize memory requirement.
the circuits handling the read-after-write hazards can be greatly
simplified. The total SRAM sizes in the two RPs of a TAU
are 16 and 8 KB, and they can provide enough bandwidth
for many computation tasks. For example, the RPs may hold
the weights and feature maps in a CNN, two input matrices
in GEMM, or the reference and current frames in motion
estimation.
The CP accepts data from the RPs, executes the Ranged
Inner-Products, and yields data to the WP. A CP contains a
Tile 1
(in use)
Tile 2
(ready)
Tile 3
(ready)
Tile 4
(partially flled)Tile 4
head
Tile 5
(not yet
allocated)
tail
SRAM
SRAM
...
Fig. 9. Buffered tiles on SRAM. The SRAM banks are used as one circular
FIFO, and each tile is allocated, committed, and freed as an atomic unit.
TABLE I
THE SIMD ISA OF MERIT-Z PROCESSOR.
Operation type Functionality Possible Usage(s)
Addition a+((b+c)>>s) Common operation.
Subtraction a+((b-c)>>s) Common operation.
1-norm a+(abs(b-c)>>s) Clustering, motion
estimation.
MAC a+((b*c)>>s) FC, CONV.
Logical max/min(a,b) ReLU, pooling.
a ? b : c
Binary bitwise ops
Indexing Load a tensor index meshgrid,
bilateral filter.
Lookup Lookup with interpolation Non-linear response
in DNN, bilateral fil-
ter, or division.
SIMD array and a 5 KB SP SRAM for storing partial results.
The SIMD ALUs run a 32b-ISA comes with only seven
kinds of instructions (Table I) for 16b fixed-point arithmetic
operations, but can support a wide range of applications. The
simplicity of this ISA is enabled by offloading the complexity
of data movement and address calculation to the MERIT
MMU.
The WP does not store any data. Instead, it shuffles and
collects data from CP by generating addresses on-the-fly
assembling the output lines to aligns with the DRAM or
cache lines. Since writing data does not cause any execution
dependency, a TAU can never stall at WP unless the DRAM
write queue is full.
8TABLE II
A BRIEF SUMMARY OF DNN HARDWARE CONFIGURATIONS.
Hardware SRAM/Word Size (KB/B) #ALUs (#ALUs/kword)
NVIDIA 1080Ti 3584 / 4 3584 4.00
Eyeriss [3] 192 / 2 168 1.76
TPU [2] 28000 / 1 65536 2.34
DianNao [26] 48 / 2 256 10.66
MEARI [33] 164 / 2 168 2.05
Ours (Per TAU) 29 / 2 32 2.20
Compute
+Write
Read
Tile 0
Read
Tile 1
Read
Tile 2
Compute
+Write
Compute
+Write
Compute
+Write
Read
Tile 0
Read
Tile 1
Read
Tile 2
Compute
+Write
Compute
+Write
Stall when 
SRAM is full
Cool down 
waste
Warm up 
waste
TAU 0
TAU 1
Folding
Fig. 10. Overlapping computation and prefetching. The MERIT-z proces-
sor supports overlapped data movement and computation intrinsically. We can
also adopt folding to eliminate wastes.
We select the total sizes of SP SRAM in the two RPs and the
CP according to several empirical rules. First, Table II shows
the total SRAM sizes and ALU number of several popular
DNN architectures. While these architectures come with such
variant computation ability and datapath, their SRAM and
ALU ratio are close, and we also select a ratio in line with
them. Second, the two buffers in RPs are not sharable in our
architecture, and we find that the buffered tensor sizes in two
RPs are mostly unequal. For example, in motion estimation
or [16], the buffered tensors of the searched frame are always
larger than the those of the current frame. We thus believe
that using asymmetric buffer sizes provides better utilization.
Third, as we discuss in Fig. 8, the partial sum is stationery
in the buffer while the buffers in RPs are also used for
prefetching. This suggest that the buffer in RPs should be a few
times larger than that in the CP. Based on these observations,
we come up with the 16, 8 KB, and 5 KB configuration.
Overall, a MERIT-z processor has two external interfaces,
namely, a data interface which uses the valid/ready channel
similar to the standard buses like AXI [43] for both address
and data, and a control interface for transmitting the transform
parameters d, s, o in (5) and the program shown in Fig. 4.
Unlike many accelerators that tend to consume tens of KBs
of instruction memory to work, our processor only uses less
than 0.5 KB of registers to define a computation kernel.
While most of the memory related optimization is encoded
in the MERIT transform, further optimization on MERIT-z
pipeline can be done by overlapping computation and memory
fetching as shown in Fig. 10, since these two operations
use different hardware units in a TAU. To further eliminate
bubbles during warm-up and cool-down stages of the pipeline,
a common technique is to reduces the parallelism by forcing
TABLE III
A COMPARISON OF ALU BANDWIDTH IN ONE PROCESSING PASS OF A
3× 3 CNN WORKLOAD.
Architecture Input Kernel Output MAC Reuse
(shape) feature feature rate
Systolic (8x16)
1 ALU 1 1 1 1 0.33
Overall 8n 128 16n 128n 5.33
Eyeriss (3x14)
1 ALU 3x4 16x3x4 16 16x3x4 0.87
1 Pass 1 ALUx16 1 ALUx3 1 ALUx14 1 ALUx3x14 8.12
Overall 1 Passxn 1 Pass 1 Passxn 1 Passxn 19.38
MERIT-z (32) 18x10x8 3x3x8x16 0 3x3x8x16x8x16 78.77
independent works to run on the same TAU, which is also
known as folding. Instead of creating dedicated hardware
units, these pipelining techniques can be implemented entirely
in software when algorithms are expressed in the MERIT
transform by merely doubling the width and halve the height of
(M(A),M(B)), and add an extra for-loop level in the Ranged
Inner-Products.
B. Architectural Comparison with DNN Accelerators
The MERIT-z processor is more general than dedicated
DNN processors, but we can achieve comparable or better
performances compared to these processors like the systolic
array (TPU) or Eyeriss. We attempt to provide some insights
by comparing the dataflow on an architectural level. As shown
in Fig. 11, we illustrate a 3 × 3 CNN workload example,
assuming the following architectures: (1) a 16 × 8 systolic
array, (2) a 12×14 Eyeriss divided into four 3×14 groups (as
described in their paper), and (3) MERIT-z processor with four
32-ALU TAUs. In Table III, we analyze the data-reuse rate of
these architectures, which is defined by the MAC count divided
by the input and output word count. Among all architectures,
MERIT-z has a significantly higher reuse rate than the others.
This is why the current MERIT-z implementation does not
include an extra memory hierarchy (i.e., global buffer) between
the TAUs and DRAM.
Now we take a closer view of the dataflow of these
architectures. The systolic array is an architecture for matrix
multiplication, so we must use 9 processing passes to compute
the convolution (Fig. 11a). Every cycle, every ALU loads two
inputs, performs a MAC and delivers one output value to its
neighbor. Overall, each pass inputs an 8n matrix and outputs
a 16n matrix, performing 128n MACs in total. This gives the
per-ALU reuse rate 1/(1+1+1) = 0.33 and overall reuse rate
128/(16 + 8) = 5.33. Since systolic array has no local (i.e.,
per-ALU) storage, the local reuse rate is only 0.33.
Eyeriss is a systolic array variant with extra local storage to
improve the reuse rate. Every Eyeriss ALU can hold up to 3×4
input feature maps, 3 × 4 × 16 kernels, and 16 partial sums,
performing 3×4×16 MACs (Fig. 11b). This almost triple the
per-ALU data-reuse. Besides, due to their application of the
multi-cast network, Eyeriss improves the overall reuse rate by
8.12/5.33 − 1 = 52% with only 42 ALUs and extra 21 KB
local storage, compared with the 128-ALU systolic array; this
98
...
Input
Feature
H-2
H
W-2
...
...
W
... ... ... ...
...
...
...
...
16  Kernels
16
Output Feature
*Conv
8
(a) Systolic, 8x16 ALUs, no input local buffer (ILB)
4
... *Conv ...
16  Kernels
Input
Feature
14
3
16
Output Feature
1
...
...
...
...
...
...
...
...
...
......
16
4
(b) Eyeriss, 3x13 ALUs, 19.5 KB ILB
Wide SRAM
 SRAM
...
 SRAM
 SRAM
 SRAM
...
 SRAM
...
 SRAM
 SRAM
 SRAM
... Output
Feature
16  Kernels
168
16
18
10
8
8
(c) MERIT-z, 32 ALUs, 24 KB ILB
Fig. 11. Comparison of data reuse in DNN accelerators. Compared with the systolic array which is for GEMM, accelerators like Eyeriss duplicate data
locally to improve data reuse. MERIT-z processor distributes a subtensor across local SRAMs and duplicates data through the butterfly network to ensure
high data reuse. The validness of the butterfly network is mathematically guaranteed in the form of a MERIT transform tensor in Section V-C.
4 8 16 32 64 128
#ALUs per TAU
20
30
40
50
Bu
ff
te
rf
ly
 A
re
a 
/ M
AC
 A
re
a 
(%
)
Fig. 12. Area overhead of the butterfly networks. According to the results,
there are 32 ALUs in a TAU, and the two 32-to-32 butterfly networks cost
40% of the 32 MAC units area.
64 128 256 512
#16b Words in One SP SRAM
1
2
4
N
or
m
al
iz
ed
 A
re
a 28 nm (m = log21.30)
32 nm (m = log21.47)
Fig. 13. Normalized SRAM area for local buffer. For the local buffers in
most DNN accelerators, a doubled storage capacity only leads to a 1.30-1.47x
area.
value can be further boosted to 19.38 by their row stationary
dataflow.
An apparent problem of the Eyeriss dataflow is the data
duplication in local storage. While their multi-cast connection
efficiently broadcasts the kernels to a row of ALUs, it also
reduces the equivalent local buffer size by an order. On the
other hand, while MERIT-z also uses similar input local buffer
sizes (less than 1 KB each), we aggregate them into a large
one with the butterfly network. We prove in Section V-C that
the butterfly network can provide full throughput for all sup-
ported workloads. The aggregated buffer sizes reach 8–16 KB,
allowing MERIT-z to perform a smaller but complete CNN
workload (Fig. 11c), which enables a partial CNN workload in
one TAU with a reuse rate 78.77 as calculated in Table III. This
makes MERIT-z more like a no-local-reuse (NLR) architecture
Row 1
Row 1
Rotated
Row 1
Row 1 Row 2 Row 2
Row 2 RotatedRow 2
Chubby Tree 4x bandwidth2x
1x
2x
4x
1x
Details of Chubby tree
Butterfly
Network
Fig. 14. Butterfly network used by MAERI. MAERI connect all its ALUs
with the tree-like connection. It distributes data to every local buffer through
Chubby Tree, which is essentially equivalent to the traditional buffer network
used by MERIT-z.
in Eyeriss taxonomy. NLR architectures read all data from a
larger global buffer and thus achieve a higher ALUs density
and data reuse rate. However, their frequent access to the
large global buffers implies high area and power overhead
efficiency compared to spatial architectures. Since MERIT-z
uses the butterfly network as its datapath, the overhead is rather
low compared with classic NLR architectures. Fig. 12 is our
synthesizing results illustrating that in a TAU with 32 ALUs,
the two butterfly networks cost only 40% of 32 MAC area.
Therefore, we can simultaneously benefit from the high data
reuse of NLR architecture as well as the power-efficiency of
systolic-like architectures. The efficiency of butterfly networks
is also demonstrated in accelerators like MAERI [33] as shown
in Fig. 14. It uses a Chubby Tree to distribute data from
the global buffer to local buffers and collects the partial sum
through a configurable adder tree. The Chubby Tree in MAERI
provides higher flexibility, but it still has data duplication in its
local buffers. Besides, the chubby tree and the classic butterfly
network are essentially topologically equivalent.
MERIT-z has several design advantages in its local storage
over the other architectures, and it also has advantages owing
to its simplicity of memory hierarchy. First, the partial sums
are written to the DRAM only when the summation tasks are
finished, which satisfies the output stationary data reuse, so
the output bandwidth is zero in Table III. Second, since the
ALUs are organized as a SIMD that executes synchronously,
the ALUs always issue vector read and write, and we use
a wide SRAM to reduce the amortized partial sum buffer
area. Last, since we can use the same input buffer for both
prefetching data and supplying data for SIMD (Fig. 9), we
10
can remove the requirement for a dedicated FIFOs that costs
30% of the local buffer area in Eyeriss. For 28–32 nm SP
SRAMs less than 1 KB, a 1.30–1.47x area means doubled
storage capacity (Fig. 13), and this explains why we can
use a larger per-ALU input local buffer size (0.75 KB) than
Eyeriss (0.50 KB).
C. Efficient Memory Distribution Circuit with MERIT
In IV-B, we mention that each ALU needs an SRAM bank
in order to maximize processor efficiency and avoid bubbles.
A static assignment between ALUs and SRAMs can be too
inflexible, but a crossbar with Θ(N2) multiplexers would be
too power-hungry and complex. In our processor, we find
that the classic butterfly network works well with MERIT
transform while utilizing only Θ(N lgN) multiplexers, which
represents only 30% of the area compared to a full crossbar
in our synthesized circuits. In the remainder of the section,
we show that, for computer vision algorithms, how a butterfly
network can correctly shuffle elements from SRAM banks to
ALUs without stalls using the MERIT transform.
Given the access patterns shown earlier in Fig. 6, we denote
the SRAM address used by the n-th ALU as An, and a data
element lies in An belongs to SRAM bank (An mod 8). In
the following discussion, we show the sufficient conditions
for hardware correctness and efficiency of MERIT transform
implementation. Specifically, we observe that when mapping
MERIT transform to SRAM and circuits, no bank conflict can
occur, and a butterfly network can distribute the data to ALUs.
In Fig. 6(ii-a), the addresses accessed by the first purple sub-
tile are A0:7 = (0, 1, 2, 3, 6, 7, 8, 9). Since the sub-tile size is
the same as the ALU number, the addresses can be formulized
as:
An = A0 +
2∑
i=0
cibn,i, (10)
where bn,i is the i-th bit (b0 is LSB) of the binary rep-
resentation of ALU index n. For instance, for (ii-a), (ii-b),
(iii), and (iv), the c0:2 are (1, 2, 6), (1, 2, 12), (1, 2, 12), and
(1, 6, 12), respectively. (11) shows a binary representation of
the addresses when A0 = 3 and c0:2 = (1, 6, 12),
A0:7 mod 8 =
[
3 4 1 2 7 8 5 6
]
=
[
1 2 4
] 1 0 1 0 1 0 1 01 0 0 1 1 0 0 1
0 1 0 0 1 0 1 1

≡ [1 2 4]S
. (11)
Upon careful inspection of this equation, we find that the ma-
trix S exhibits some special attributes that can be represented
by a hash property matrix H whose elements hi,j can be
computed using the matrix S,
hi,j =

0 : sn,j = sn∧2i,j ∀n
1 : sn,j 6= sn∧2i,j ∀n
x : Otherwise
i.e., (sn,j , sn∧2i,j) are not constrainted
, (12)
where ∧ means the bitwise XOR function, and n ∧ 2i means
flipping the i-th bit of n. Readers can verify that we can obtain
this property matrix for (11) regardless of the base addresses
of A0,
H =
1 0 0x 1 0
x x 1
 . (13)
The positions of 1s in H are directly related to the power of 2
of the prime factorization of c0:2. In the following discussions,
we define the operators on symbols (0, 1, x) as the standard
operators in ternary logic. Equipped with this hash property
matrix, we assert that the sufficient condition for a MERIT
transform to map to a butterfly network is whether the matrix
H can be reduced to the identity matrix I as follows. First,
we select a row that contains only 0s and 1s. Then, we
compute the NOT version of this row and apply it to another
row with the AND operation. The process is repeated until it
produces, or fails to produce, an I. This reduction algorithm
is similar to a standard Gaussian elimination process without
row swapping. For example,
H1 =
1 0 0x 1 0
x x 1
 (14)
represents compatible address mapping for the butterfly net-
work, while
H2 =
1 0 xx 1 0
0 x 1
 (15)
does not since we cannot find a row without x to start.
In a more complex computation task, such as CNN with
strides or dilated CNN, as shown in Fig. 6(ii-c), H could
become a nonsquare matrix. When this happens, we apply
the following procedure to map a nonsquare H matrix into a
square one H′,
H′ ≡ RXH =
0 1 00 0 1
1 0 0

Bit rotate
11 0 0 00 1 1 0
0 0 1 1

XOR hash

0 0 1
0 0 x
1 0 x
x 1 x

=
1 0 xx 1 x
0 0 1

, (16)
where addition and multiplication are defined as XOR and
AND operators in the ternary logic. Equation (16) shows a
4 × 3 property matrix H created using c0:2 = (4, 8, 3), Two
additional matrices, R and X, are introduced to convert it
into a squared matrix H′. X is an upper triangular matrix that
trims H into a square matrix, and it has no more than one off-
diagonal term per row. Matrix R is a row permutation matrix
for swapping upper rows to the bottom, and it can be applied
multiple times. Again, readers can verify that the resulting
matrix H′ fulfills the sufficient condition above.
Since matrices S, R, and X are all binary matrices, we
can map the math above into pure data shuffle circuits and
11
place them in the Read Pipeline (Fig. 8). For a 32-core TAU,
the data comes into the SRAM through the collector with a
log2 32 = 5-stage butterfly network, followed by a log2 5 = 3-
stage omega network that implements the matrices (X,R)
above. The data move from the SRAM into the Compute
Pipeline with another log2 32 = 5-stage butterfly network.
The discussions above imply that, we can use classic data
distribution circuit blocks to effectively feed data into the
processors to support a wide variety of workloads.
VI. EVALUATION
A. Code Size Reduction with MERIT
In this experiment, we collect fast kernels from open-source
projects such as Caffe, OpenCV, and Parboil [39], [44], [45],
and rewrite them in order to demonstrate the code size reduc-
tion effect with the MERIT transform. Listing 2 and 3 show
an example of the bilateral filter [19] implemented as an inner-
product strategy class, where the Gaussian weights of spatial
kernels are precomputed as lookup-tables. In Loop(), the
neighbor pixels and two spatial kernels are packed in the array
i. Listing 3 specifies the input as an h-by-w image divided
into 16-by-16 blocks, and each thread with a block performs
a k-by-k local window scan. The middle lines represent the
M(·)s, and the range term σr is passed to the kernel in the
last line.
Listing 2. A strategy class for bilateral filter.
class BilateralStrategy {
float wsum, wxsum, center;
struct Constant {float norm;};
// Note: Function signatures are macros.
PreLoop(0,1) {wsum = wxsum = 0; center = i[0];}
Loop(0,3) {
float d = center-i[0];
float w = expf(d*d*c.norm)*i[1]*i[2];
wsum += w; wxsum += w*i[0];
}
PostLoop(1,0) {o[0] = wxsum/wsum;}
};
Listing 3. Bilateral filter using MERIT transform.
InnerProduct<BilateralStrategy>(
// size and tile size
{{h,16}, {w,16}}, {{k,k}, {k,k}},
// d_i, o_i, s_i for input
{{0,0,1}, {1,0,1}},
{{0,-k/2,1}, {1,-k/2,1}},
{h, w} // h-by-w image
// d_i, o_i, s_i for spatial weight
{/*Nothing*/},
{{0,0,1}, {/*Nothing*/}},
{k} // vector of length k
// More configurations of M()s ...
Bilateral::Constant{0.1}
);
Table IV lists the number of lexical tokens required to
express different algorithms with MERIT, our na¨ive C++
implementation, and fast open-source kernels for the GPUs.
As shown in the table, because code regarding data movement
is no longer needed in MERIT, we can greatly reduce the
number of both arithmetic operators and identifier tokens even
compared to the naı¨ve CPU implementations.
TABLE IV
CODE TOKEN COUNT COMPARISON.
Kernel Token Type MERIT Na¨ive Open-Source
on GPU on CPU on GPU
Motion Identifier 49 80 194
Estimation Operator 11 45 106
Bilateral Identifier 69 87 122
Filter Operator 22 49 61
Forward Identifier 53 110 204
Propagation Operator 7 65 117
GEMM Identifier 34 34 146
Operator 7 20 35
Integral Identifier 24 23 50
Image Operator 6 14 22
Separable Identifier 35 50 91
Filter Operator 5 29 47
TABLE V
SPEEDUP ON GPUS COMPARED TO OPENCV, PARBOIL, AND CAFFE.
Kernels Note Speedup
Separable Filter k = 3 0.35
k = 30 1.42
Motion Estimation 6.51
Forward Propagation Ours 3 + 1s 19.9
(kernel size+stride, Ours 9 + 1s 26.4
32 channels) Ours 3 + 2s 1.80
Ours 9 + 2s 2.83
cuDNN 3 + 1s 100
cuDNN 9 + 1s 109
cuDNN 3 + 2s 27.1
cuDNN 9 + 2s 27.3
B. MERIT Transform Performance on GPUs
Table V shows the kernel performance gain by the MERIT
transform. For the separable filter in OpenCV, our approach
is faster except for very small kernel sizes owing to the pre-
computation overhead. It is particularly interesting because,
despite the simplistic nature of this kernel, we are able to ex-
tract an extra 40% performance gain over OpenCV. Such extra
gain comes from the address calculations buffer[w*x+y]
in the code, which takes several instructions while the actual
MAC only takes one. We replace all address calculation by
constant memory lookup in all critical paths (i.e., inner loops),
reducing the instruction count to two. While this technique is
not only used by MERIT transform, such optimization requires
detailed knowledge about the GPUs, which is tedious and
obfuscates the program. With a unified transform expressing
various vision tasks, these tasks can immediately benefit
more from parallel computing with one optimized MERIT
implementation.
On the other hand, although there remains a large gap
between MERIT and the heavily-tuned cuDNN from NVIDIA,
our implementation still surpasses Caffe, which is built on top
of the heavily optimized cuBLAS linear algebra library, also
from NVIDIA. Also note that we hold an edge against Caffe
for smaller strides and larger kernels, showing a performance
pattern in line with cuDNN.
12
TABLE VI
COMPARISON OF DNN ASIC ACCELERATORS; WE CITE THE BACKEND
METRICS FROM [33].
Systolic Eyeriss MAERI MERIT-z
(TPU) (4-TAU)
Process (nm) 28 28 28 28
#ALUs (MAC) 168 168 168 128
Frequency (MHz) 200 200 200 400
Peak MAC per second 33G 33G 33G 51G
Density (ALU/mm2) 62 25 44 86
Area (mm2) 2.7 6.0 3.8 1.5
Power per MAC (pJ) 5.1 9.0 11.2 7.5
Core Power (mW) 168 304 379 386
TABLE VII
AREA AND POWER BREAKDOWN OF A 4-TAU MERIT-Z.
Module Logic (%) SRAM (%) Power (%)
Compute Pipeline 15.7 22.0 57
Read Pipelines 16.5 36.5 39
Write Pipeline + Misc. 9.3 0.0 7
Total (absolute) 5.81 mm2 or 386 mW
4.08M gate count
C. MERIT-z Processor ASIC Implementation
In Table VI, we compare MERIT-z with several dedicated
DNN accelerators [3], [33], and Table VII shows the area
and power breakdown of our processor. We synthesize the 4-
TAU MERIT-z (128 ALUs) at 400 MHz to provide a similar
computation power to these works. Our architecture is simi-
larly power efficient (per MAC) compared to these processors,
and it only uses half as much area. While comparing our
frontend numbers against published backend metrics plays in
our favor, it is clear that we compare favorably against these
processors in terms of area and power. Since current MERIT-z
implementation excludes the global buffer, it has the highest
area efficiency in this table.
These gains come from several factors. First, MERIT trans-
form allows us to use the much smaller single-port (SP)
SRAMs, which are only half size compared to 2-port (TP)
SRAMs, and this results in an overall area reduction of
over 30% with at most 3% utilization rate drop. Second,
as discussed in Section V-B, MERIT-z can achieve a high
utilization rate with a comparatively smaller SRAM storage
because of the efficient MERIT transform application over the
butterfly network and omega network. The simplicity of these
classic circuit networks also allows us to run the chip at a much
higher frequency because of their circuit routing efficiency.
To ensure the evaluation accuracy, we adopt a cycle-accurate
DRAM simulator called Ramulator [46] and integrate it into
our simulation through Verilog VPI. We choose the 1-rank, 2-
channel, DDR3 1600, 2Gbx8, and 3.2 GBps bandwidth in the
simulation settings, and the L1 cache size is set to 1 KB mainly
to cover the misaligned DRAM access. Also, we synthesize
the MERIT-z using the TSMC 28 nm low-power RVT library
with power simulation reported using Synopsys PrimeTime.
In Table VIII, we show the performance analysis of MERIT-
TABLE VIII
DNN WORKLOADS ON MERIT-Z WITH 3.2 GBPS DRAM.
Layer Power (mW) Utilization
AlexNet (overall) [20] 386 0.88
CONV1+Pool2 334 0.88
CONV2+Pool2 392 0.95
CONV3 363 0.77
CONV4 363 0.72
CONV5 363 0.72
FC1-3 (batch=32) 360 0.92
VGG-16D (overall) [21] 355 0.85
CONV1 300 0.89
CONV2-4+Pool2 363 0.95
CONV5-13+Pool2 363 0.83
FC1-3 (batch=32) 360 0.92
Saliency (overall) [22] 377 0.77
CONV1+Pool3 412 0.72
CONV2+Pool3 428 0.82
CONV3+Pool3 420 0.65
DECONV4 292 0.87
DECONV5+threshold 302 0.71
Alpha blending 64 0.07
TABLE IX
MERIT-Z EFFICIENCY FOR POPULAR WORKLOADS.
Computation Output Tensor Size Utilization
Dilated [13] 64× 64× 32 0.95
Pixel Shuffle (ESPCN) [18] 32× 32× 32 0.96
Correlation (FlowNet) [16] 64× 64× 11× 11 0.74
Depthwise (MobileNet) [17] 64× 64× 32 0.63
GEMM (FC, RNN) 256× 128 0.92
Motion estimation 720p, 8× 8 block 0.74
z running AlexNet [20] and VGG [21]. In addition, we also
benchmark a DNN-based saliency detection algorithm [22]
consisting of additional DECONV layers and traditional image
operations. In Table IX, we further benchmark several DNN
layers developed in the state-of-the-art computer vision con-
ferences [13], [16]–[18], and these layers cannot be supported
by any existing, single architecture because existing ones are
designed for specific tasks. All of these layers are expressed
with MERIT transforms, and the optimization techniques men-
tioned previously can be applied out-of-the-box. For example,
with the help of optimization techniques shown in Fig. 10, we
can achieve acceptable efficiency for depth-wise convolution
even though this algorithm is mostly memory-bounded.
D. Scaling up MERIT-z Processor
The vector-architecture nature of MERIT-z means that the
area can be scaled up fairly linearly when adding more TAUs
to MERIT-z. On the other hand, Fig. 15 shows the relative
speedup with ALU counts between 32 and 1024, using the
same DDR3 3.2 GBps simulation setting as before. Most algo-
rithms, except for the depth-wise convolution, scale up well up
to 256 ALUs, beyond which the applications become DRAM
memory-bounded entirely. Since MERIT-z only utilizes one
level of the memory hierarchy, we eliminate the need for
13
moving data from the global prefetch buffer to local buffers,
which can be a bottleneck reported by [3]. Therefore, for DNN
requiring multiple processing passes like VGG CONV1, we
achieve particularly higher utilization than other works, and
such feature is essential for larger DNN workloads that are
likely to appear in the future.
VII. CONCLUSION
In this paper, we proposed the MERIT transform, a math-
ematical framework for transforming vision processing tasks
into SIMD-friendly workloads by separating data movement
from computations. Since the core of algorithm optimization
often lies in the data movement process, it allows us to write
fast parallel kernels with very small lines of code. Because
this process is similar across different algorithms on the same
processor, we created a library for CUDA GPUs so as to free
programmers from the burden of repeated optimization efforts,
such as thread tiling and shared memory access. We also use
these insights to design the MERIT-z processor. The processor
can perform a wide range of applications, such as DNN, image
processing, and traditional machine learning applications, with
comparable area and power efficiency to several dedicated
DNN processors. It uses classic circuit blocks such as butterfly
networks to shuffle data to the ALUs efficiently. Also, we
have released both the CUDA library and MERIT-z processor
implementations under the open-source GPL license.
The mathematical framework has proven to be a useful
utility for verifying the efficiency of algorithms against proces-
sors. We shall continue to identify more useful properties for
the MERIT transform in order to exploit opportunities for data
reuse between processors during execution. We would also like
to look into more matching patterns between transforms and
circuits, provide more automatic means for generating these
patterns, such that it can become more powerful and useful
for the parallel and scientific computing community.
ACKNOWLEDGMENT
This work was partially sponsored by MediaTek Inc., Hsin-
chu, Taiwan and MultiTek Inc., Hsin-chu, Taiwan.
APPENDIX
The CUDA implementation of MERIT can be found in
Github under mediaic/UMI, and the SystemVerilog imple-
mentation will soon be available under mediaic/MERIT.
REFERENCES
[1] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran,
B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitives for
deep learning,” CoRR, vol. abs/1410.0759, 2014. [Online]. Available:
http://arxiv.org/abs/1410.0759
[2] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor
processing unit,” in Proceedings of the 44th Annual International
Symposium on Computer Architecture, ISCA 2017, Toronto, ON,
Canada, June 24-28, 2017, 2017, pp. 1–12. [Online]. Available:
http://doi.acm.org/10.1145/3079856.3080246
[3] Y. H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture
for energy-efficient dataflow for convolutional neural networks,” in
2016 ACM/IEEE 43rd Annual International Symposium on Computer
Architecture (ISCA), June 2016, pp. 367–379.
[4] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou,
and Y. Chen, “Pudiannao: A polyvalent machine learning accelerator,”
SIGARCH Comput. Archit. News, vol. 43, no. 1, pp. 369–381, Mar.
2015. [Online]. Available: http://doi.acm.org/10.1145/2786763.2694358
[5] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net:
Imagenet classification using binary convolutional neural networks,”
CoRR, vol. abs/1603.05279, 2016. [Online]. Available: http://arxiv.org/
abs/1603.05279
[6] U. Ko¨ster, T. Webb, X. Wang, M. Nassar, A. K. Bansal, W. Constable,
O. Elibol, S. Hall, L. Hornof, A. Khosrowshahi, C. Kloss,
R. J. Pai, and N. Rao, “Flexpoint: An adaptive numerical format
for efficient training of deep neural networks,” in Advances in
Neural Information Processing Systems 30: Annual Conference on
Neural Information Processing Systems 2017, 4-9 December 2017,
Long Beach, CA, USA, 2017, pp. 1740–1750. [Online]. Avail-
able: http://papers.nips.cc/paper/6771-flexpoint-an-adaptive-numerical-
format-for-efficient-training-of-deep-neural-networks
[7] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized
convolutional neural networks for mobile devices,” in 2016 IEEE
Conference on Computer Vision and Pattern Recognition, CVPR 2016,
Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 4820–4828. [Online].
Available: https://doi.org/10.1109/CVPR.2016.521
[8] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and
W. J. Dally, “EIE: efficient inference engine on compressed deep
neural network,” CoRR, vol. abs/1602.01528, 2016. [Online]. Available:
http://arxiv.org/abs/1602.01528
[9] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-
normalizing neural networks,” CoRR, vol. abs/1706.02515, 2017.
[Online]. Available: http://arxiv.org/abs/1706.02515
[10] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” CoRR,
vol. abs/1502.03167, 2015. [Online]. Available: http://arxiv.org/abs/
1502.03167
[11] K. He, G. Gkioxari, P. Dollr, and R. Girshick, “Mask R-CNN,” in 2017
IEEE International Conference on Computer Vision (ICCV), Oct 2017,
pp. 2980–2988.
[12] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for
semantic segmentation,” CoRR, vol. abs/1505.04366, 2015. [Online].
Available: http://arxiv.org/abs/1505.04366
[13] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated
convolutions,” in ICLR, 2016.
[14] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable
convolutional networks,” in 2017 IEEE International Conference on
Computer Vision (ICCV), Oct 2017, pp. 764–773.
[15] M. Buckler, P. Bedoukian, S. Jayasuriya, and A. Sampson, “EVA2
: Exploiting temporal redundancy in live computer vision,” CoRR,
vol. abs/1803.06312, 2018. [Online]. Available: http://arxiv.org/abs/
1803.06312
[16] A. Dosovitskiy, P. Fischer, E. Ilg, P. Ha¨usser, C. Hazırbas¸, V. Golkov,
P. v.d. Smagt, D. Cremers, and T. Brox, “FlowNet: Learning
optical flow with convolutional networks,” in IEEE International
Conference on Computer Vision (ICCV), 2015. [Online]. Available:
http://lmb.informatik.uni-freiburg.de/Publications/2015/DFIB15
[17] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
M. Andreetto, and H. Adam, “MobileNets: Efficient convolutional neural
networks for mobile vision applications,” CoRR, vol. abs/1704.04861,
2017. [Online]. Available: http://arxiv.org/abs/1704.04861
[18] W. Shi, J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop,
D. Rueckert, and Z. Wang, “Real-time single image and video super-
resolution using an efficient sub-pixel convolutional neural network,”
in The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2016.
[19] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color
images,” in Sixth International Conference on Computer Vision (IEEE
Cat. No.98CH36271), Jan 1998, pp. 839–846.
[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in Neural Infor-
mation Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and
K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.
[21] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” in International Conference on Learning
Representations, 2015.
[22] S.-Y. Wu, Y.-S. Lin, W.-C. Tu, and S.-Y. Chien, “Hardware-efficient
two-stage saliency detection,” in IEEE International Workshop on Signal
Processing Systems (SiPS), 2018.
14
GE
MM
Ale
xNe
t
CO
NV
1 Ale
xNe
t
CO
NV
2
Ale
xNe
t
CO
NV
3-5
&
VG
G
CO
NV
2-1
3
VG
G
CO
NV
1 Dila
ted Pix
el
shu
ffle
Cor
rela
tion
Mo
tion
est
ima
tion Dep
thw
ise
2 3
2 2
2 1
20
U
til
iz
at
io
n 
R
at
e Eyeriss
1 TAU
2 TAU
4 TAU
8 TAU
16 TAU
32 TAU
Fig. 15. Utilization scaling of MERIT-z processor. MERIT-z remains highly utilized among a variety of workloads up to 256 ALUs, and for comparison
we also calculate and plot the utilization rate of Eyeriss according to their processing latency.
[23] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and
S. Amarasinghe, “Halide: A language and compiler for optimizing
parallelism, locality, and recomputation in image processing pipelines,”
SIGPLAN Not., vol. 48, no. 6, pp. 519–530, Jun. 2013. [Online].
Available: http://doi.acm.org/10.1145/2499370.2462176
[24] Y. W. Hu et al. Optimize deep learning GPU operators
with TVM: A depthwise convolution example. [Online]. Avail-
able: http://tvmlang.org/2017/08/22/Optimize-Deep-Learning-GPU-
Operators-with-TVM-A-Depthwise-Convolution-Example.html
[25] N. Weber and M. Goesele, “Adaptive gpu array layout auto-tuning,” in
Proceedings of the ACM Workshop on Software Engineering Methods for
Parallel and High Performance Applications. ACM, August 2016, pp.
21–28. [Online]. Available: http://tubiblio.ulb.tu-darmstadt.de/82600/
[26] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
“Diannao: A small-footprint high-throughput accelerator for ubiquitous
machine-learning,” SIGPLAN Not., vol. 49, no. 4, pp. 269–284, Feb.
2014. [Online]. Available: http://doi.acm.org/10.1145/2644865.2541967
[27] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on
large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.
[Online]. Available: http://doi.acm.org/10.1145/1327452.1327492
[28] W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis,
and M. A. Horowitz, “Convolution engine: Balancing efficiency
and flexibility in specialized computing,” SIGARCH Comput. Archit.
News, vol. 41, no. 3, pp. 24–35, Jun. 2013. [Online]. Available:
http://doi.acm.org/10.1145/2508148.2485925
[29] Y. S. Lin, W. C. Chen, and S. Y. Chien, “Unrolled memory inner-
products: An abstract gpu operator for efficient vision-related compu-
tations,” in 2017 IEEE International Conference on Computer Vision
(ICCV), Oct 2017, pp. 4587–4595.
[30] C. D. Schuman, T. E. Potok, R. M. Patton, J. D. Birdwell, M. E. Dean,
G. S. Rose, and J. S. Plank, “A survey of neuromorphic computing
and neural networks in hardware,” CoRR, vol. abs/1705.06963, 2017.
[Online]. Available: http://arxiv.org/abs/1705.06963
[31] F.-B. Tu. Neural networks on silicon. [Online]. Available: https:
//github.com/fengbintu/Neural-Networks-on-Silicon
[32] D. Shin, J. Lee, J. Lee, and H. J. Yoo, “14.2 DNPU: An 8.1TOPS/W
reconfigurable CNN-RNN processor for general-purpose deep neural
networks,” in 2017 IEEE International Solid-State Circuits Conference
(ISSCC), Feb 2017, pp. 240–241.
[33] H. Kwon, A. Samajdar, and T. Krishna, “MAERI: Enabling
flexible dataflow mapping over dnn accelerators via reconfigurable
interconnects,” in Proceedings of the Twenty-Third International
Conference on Architectural Support for Programming Languages and
Operating Systems, ser. ASPLOS ’18. New York, NY, USA: ACM,
2018, pp. 461–475. [Online]. Available: http://doi.acm.org/10.1145/
3173162.3173176
[34] J. Hegarty, J. Brunhaver, Z. DeVito, J. Ragan-Kelley, N. Cohen, S. Bell,
A. Vasilyev, M. Horowitz, and P. Hanrahan, “Darkroom: Compiling
high-level image processing code into hardware pipelines,” ACM Trans.
Graph., vol. 33, no. 4, pp. 144:1–144:11, Jul. 2014. [Online]. Available:
http://doi.acm.org/10.1145/2601097.2601174
[35] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides,
J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Hasel-
man, S. Hauck, S. Heil, A. Hormati, J. Y. Kim, S. Lanka, J. Larus,
E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger, “A
reconfigurable fabric for accelerating large-scale datacenter services,” in
2014 ACM/IEEE 41st International Symposium on Computer Architec-
ture (ISCA), June 2014, pp. 13–24.
[36] J. C. Chen and S. Y. Chien, “CRISP: Coarse-grained reconfigurable
image stream processor for digital still cameras and camcorders,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 18,
no. 9, pp. 1223–1236, Sept 2008.
[37] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally,
E. Lindholm, and K. Skadron, “Energy-efficient mechanisms for
managing thread context in throughput processors,” in Proceedings of
the 38th Annual International Symposium on Computer Architecture,
ser. ISCA ’11. New York, NY, USA: ACM, 2011, pp. 235–246.
[Online]. Available: http://doi.acm.org/10.1145/2000064.2000093
[38] K. Wang and C. Lin, “Decoupled affine computation for simt gpus,” in
Proceedings of the 44th Annual International Symposium on Computer
Architecture, ser. ISCA ’17. New York, NY, USA: ACM, 2017, pp. 295–
306. [Online]. Available: http://doi.acm.org/10.1145/3079856.3080205
[39] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
fast feature embedding,” arXiv preprint arXiv:1408.5093, 2014.
[40] H. Nguyen, Gpu Gems 3, 1st ed. Addison-Wesley Professional, 2007.
[41] C. Gou, G. K. Kuzmanov, and G. N. Gaydadjiev, “Sams: Single-
affiliation multiple-stride parallel memory scheme,” in Proceedings of
the 2008 Workshop on Memory Access on Future Processors: A Solved
Problem?, ser. MAW ’08. New York, NY, USA: ACM, 2008, pp. 350–
368. [Online]. Available: http://doi.acm.org/10.1145/1366219.1366220
[42] D. T. Harper and D. A. Linebarger, “Conflict-free vector access using
a dynamic storage scheme,” IEEE Transactions on Computers, vol. 40,
no. 3, pp. 276–283, Mar 1991.
[43] ARM. The official document of the AXI protocol. [Online]. Available:
http://infocenter.arm.com/help/index.jsp
[44] D. G. R. Bradski and A. Kaehler, Learning OpenCV, 1st Edition, 1st ed.
O’Reilly Media, Inc., 2008.
[45] J. A. Stratton, C. Rodrigues, I. J. Sung, N. Obeid, L. W. Chang,
N. Anssari, G. D. Liu, and W. W. Hwu, “Parboil: A revised benchmark
suite for scientific and commercial throughput computing,” Center for
Reliable and High-Performance Computing, 2012.
[46] Y. Kim, W. Yang, and O. Mutlu, “Ramulator: A fast and extensible
DRAM simulator,” IEEE Computer Architecture Letters, vol. 15, no. 1,
pp. 45–49, Jan 2016.
