Computationally intensive deep neural networks (DNNs) are well-suited to run on GPUs, but newly developed algorithms usually require the heavily optimized DNN routines to work efficiently, and this problem could be even more difficult for specialized DNN architectures. In this article, we propose a mathematical formulation that can be useful for transferring the algorithm optimization knowledge across computing platforms. We discover that data movement and storage inside parallel processor architectures can be viewed as tensor transforms across memory hierarchies, making it possible to describe many memory optimization techniques mathematically. Such transform, which we call memory-efficient ranged inner-product tensor (MERIT) transform, can be applied to not only DNN tasks but also many traditional machine learning and computer vision computations. Moreover, the tensor transforms can be readily mapped to existing vector processor architectures. In this article, we demonstrate that many popular applications can be converted to a succinct MERIT notation on GPUs, speeding up GPU kernels up to 20 times while using only half as many code tokens. We also use the principle of the proposed transform to design a specialized hardware unit called MERIT-z processor. This processor can be applied to a variety of DNN tasks as well as other computer vision tasks while providing comparable area and power efficiency to dedicated DNN application-specific integrated circuits (ASICs).
I. INTRODUCTION
R ECENTLY, deep learning technology has gained remarkable success in many fields such as computer vision, image processing, and natural language processing. The computations of deep neural networks (DNNs) tend to be fairly regular, and this means they are well-suited for highly parallel processors such as general-purpose graphics processing units (GPGPUs). While GPU programming languages like CUDA and OpenCL have reached maturity and shown their promises in many scientific fields, it still requires specialized knowledge in GPU architectures in order to implement DNN algorithms efficiently. In some cases, libraries are implemented with insider knowledge that is not available to the public, such as cuDNN from NVIDIA [1] . Manuscript For the purpose of efficient DNN computation, a recent trend is to create application-specific integrated circuits (ASICs), as they can often achieve lower power consumption or higher performance compared to general processors. Popular designs include systolic arrays due to its ability to directly exchange data between processors [2] , [3] and GPU-like vector architectures with reduced arithmetic precision and larger on-chip memory [4] . For extremely lowpower devices, ASICs can also help reduce memory pressure by processing compressed neural networks. These techniques include binarization [5] , [6] , dictionary-based compression [7] , and pruning-based entropy compression [8] . While ASICs are more power efficient than GPUs, they are not flexible enough to keep up with the evolution of DNN techniques [9] - [18] . The new types of computations can limit the widespread adoption of DNN ASICs since the arithmetic units and memory organization tend to be highly optimized for specific types of networks, which makes it difficult or impractical to map these new network layers to ASICs.
Even though the processors come in different varieties, the principles for optimizing data-intensive algorithms are always the same-bring data closer to the processing elements. For this purpose, DNN processors often do not employ traditional memory hierarchies. Instead, programmers are given the freedom to explicitly control data movement between different types of memories, such as local, on-chip, or off-chip memory buffers. Unfortunately, this means each algorithm needs to be optimized for each new processor by redesigning a new data movement procedure, causing repeated and wasted efforts.
To solve this problem, we propose the memory-efficient ranged inner-product tensor (MERIT) transform which can incorporate data movement patterns into algorithms, as shown in Fig. 1(a) . A MERIT transform converts one matrix (or generally, one tensor) into another by performing permutation and repetition of the matrix elements. This allows us to convert computational tasks into vector-wise products or reductions, as shown in Fig. 1(b ). An efficient MERIT transform therefore conforms to the steps of data journey through the memory hierarchy, and each step performs a partial transform representing a data movement procedure, such as tiling, shuffling, or bank conflict avoidance, as shown in Fig. 1(c) . These partial transforms correspond to common resource constraints of GPUs or DNN architectures, such as limited on-chip SRAM or partially connected data networks. Therefore, when a general vision-related computation can be reduced to a MERIT transform, many memory optimization techniques can be implemented on a particular architecture with very little effort.
To demonstrate the usefulness of MERIT, we use it to optimize several algorithms on GPUs and ASICs, including bilateral filter [19] , motion estimation, and many DNN layers [13] , [16] - [18] , [20] - [22] . For GPUs, we implement MERIT for CUDA as a header file, which can be easily integrated into existing algorithms to provide higher performance than many hand-tuned implementations. The simplicity of the MERIT representation, and its strong connection between mathematics and architecture, means that we can also use it to create hardware with relatively little cost and effort. We adopt this approach to design an open-source ASIC processor, the MERIT-z, that uses classic, efficient circuits, such as butterfly networks, to achieve efficient DNN and computer vision performance.
Specifically, our contributions are as follows. 1) MERIT transform: A data transform methodology that can succinctly describe data movement and reduce optimization efforts in parallel processing. 2) MERIT for CUDA: An open-source API that produces fast GPU kernels with fewer code tokens compared to näive CPU implementations.
3) MERIT-z processor:
An open-source general vector processor designed with insights gained from the MERIT transform, supporting both common DNN layers and traditional vision processing.
In Section II, we build up the background knowledge for readers. In Section III, we formally define the MERIT transform and illustrate how to express different vision workloads with the transform. Based on such transform definition, in Sections IV and V, we show how to map parallel workloads to an efficient CUDA execution and onto the MERIT-z processor. Finally, we show the results of the experiments and sum up this article.
II. RELATED WORKS

A. Kernel Transform for Parallel Program Optimization
For parallel architectures, the performance of a computation task is usually limited by the memory bandwidth to the computing units. While there are many common principles for memory optimization, it has never been easy to optimize an algorithm on a particular parallel architecture, and there has been extensive research to tackle this problem.
One approach is to design a new, domain-specified programming language (DSL) that can describe applications in specific domains, such as Halide [23] and a tensor virtual machine (TVM) [24] . An algorithm written in DSLs usually requires fewer lines of code than those in generic languages like C++ or OpenCL. DSLs can also be compiled to target specific types of hardware, and the process often involves searching for parameters for different memory configurations like [25] . This means the design of DSL compilers and their searching strategies require significant efforts, and their language features are less stable because they can evolve more rapidly compared to generic languages. Some DSLs also involve advanced programming concepts, making it difficult to appeal to the wider audience. For example, Halide defines images as N-dimension functions and describes the image pipelines as functional compositions. In TVM, the computation schedule is stored in symbolic variables for programmers could use to build algorithms.
Another simpler approach is to convert a new computation problem into an existing one. For example, a common implementation for the CONV layer in deep learning on GPUs is to unroll, or flatten, patches of an image to columns of a matrix. This transforms the CONV layer into a GEneral Matrix-Matrix multiplication (GEMM), a routine that has been heavily optimized in existing libraries such as cuBLAS. This technique has also been applied to ASIC design in order to support CONV on GEMM architectures [2] , [26] . Conversionbased methods are easier to implement since programmers only need to write the conversion routine to take advantage of the well-optimized routines. However, not every problem can be easily converted, and the process may sometimes introduce significant overhead which shall be discussed further in Section III.
The third approach is to divide algorithms into combinations of interchangeable function blocks. Programmers implement algorithms by connecting different blocks without having to understand how to actually schedule the computation. For example, in a GPU, programmers write the vertex and fragment shaders to describe object transform and color shading rules, and the 3-D objects are rendered to the 2-D screen automatically by the processor. MapReduce [27] decomposes large-scale text-processing tasks into data sorting and two custom functions, map and reduce. Convolution Engine [28] is a MapReduce-based streaming image processor with a restricted map and reduce operators. UMI Operator [29] shows several vision-related computations that can be described by a custom unrolling procedure and a generalized inner-product function.
Later in this article, we show that, for many vision-related kernels or algorithms, we can decompose their data movement processes such that algorithms can run on simple SIMD architectures. This transform is related to the aforementioned unrolling and can be defined with a few parameters for the data network hardware to execute efficiently. We also propose a methodology for mapping this transform to GPUs and ASICs, which can greatly reduce the optimization effort for parallel programs.
B. Hardware for Image, Vision, and DNN
Here, we review several DNN, machine-learning accelerators, and vision processors published in mainstream conferences for graphics, computer architecture, and circuit technology. For a more comprehensive review of DNN accelerators, we recommend the readers to consult survey articles such as [30] and [31] .
Dedicated DNN ASICs are usually designed for speeding up either FC or CONV layers, but they can also support other types of layers via conversion. For example, DianNao [26] uses a (16 × 16)-by-(16) matrix-vector processor to compute both FC and CONV layers, and it needs to unroll the feature maps in the input buffers in order to evaluate the CONV layers. A tensor processing unit (TPU) [2] uses systolic arrays and is able to perform a variable (256 × 256)-by-(256 × N) matrixmatrix operation, and it supports CONV layers by storing the unrolled feature maps within 256 separate queues. Eyeriss [3] holds a row in the processor and adds a path to broadcast rows for the CONV layers, but this path is not utilized by the FC layers. In a more recent work [32] , separate hardware units are used to process FC and CONV layers independently. MAERI [33] adopts a tree-based distribution and reduction path to support more types of DNN layers, but it can be difficult to configure these trees.
There exist ASICs designs that aim to support more general parallel computation tasks. Convolution Engine [28] and PuDianNao [4] use the fixed function pipeline architecture and support different types of computations by enabling different functional units. Reconfigurable interconnection is yet another example, where field-programmable gate arrays (FPGAs) are dynamically programmed to execute various computing tasks [34] , [35] . In a coarse-grained reconfigurable image stream processor (CRISP) [36] , different functional units are connected by a programmable crossbar, and various image processing methods could be applied as image pixels are streamed into the processor.
Register File Cache prevents the round trip between processors and the SRAM, improving the overall throughput and reducing the power consumption [37] . Affine Warp detects the regular register values across parallel threads, minimizing the redundancy of computing and storing these registers [38] . The design of our MERIT-z processor is related to these GPGPU optimization research, except that our architecture is derived through insights from the mathematical framework presented in this article.
III. MERIT TRANSFORM
The MERIT transform, denoted as M(·), is a mathematical process for converting a tensor into another, such that a given algorithm can be transformed into the SIMD domain for easier operation.
As an example, we can apply the MERIT transform to the GEMM algorithm C = AB and obtain
where R(X, Y, ) means applying dot-product to every row of (X, Y), Vec(C) is the vectorized form of matrix C, and the MERIT transforms M(A) and M(B) are created by repeating rows or columns of the input matrices (see Fig. 2 ). While this formulation appears to complicate matters, there are several computational advantages to this expression. First, each dotproduct can be computed independently, which means this equation can be evaluated in parallel. Second, M(·) and can both be specified by the users, making this equation configurable for general computational tasks. Third, elements in the transformed matrices are copies of the original ones, which means M(·) is a pure data movement operation, leaving as the only real arithmetic operations in the equation. Because the MERIT transform M(·) implies data movement, it is important to make sure M(·) maps to actual hardware to achieve optimal performance. For this purpose, we factorize the MERIT transform M(A) = μ n · · · • μ 2 • μ 1 (A), where each partial transform, or substep μ i , is a repetition and permutation on the input data [see Fig. 1 
For a particular architecture, we would need a particular set of substeps to reflect its memory and data-path design such as multibank SRAM or DRAM. Fortunately, these substeps can often be reused for the same architecture across different algorithms, and we will discuss the subject of efficient MERIT transform in Section IV. Before then, let us start with a few more examples to familiarize the readers with the benefits of this transform. Fig. 3 (a) shows a popular technique for accelerating convolution on SIMD processors [39] . This process converts convolution into GEneral Matrix-Vector multiplications (GEMVs) by unrolling the input image A and vectorizing the kernel B, which can be written as
A. Convolution on SIMD Processors
This is very effective because it takes advantage of the welloptimized GEMV routines on SIMD processors. However, unrolling creates an unnecessarily large input matrix U (A), which is clearly suboptimal. A more optimal process would be to replicate data on-the-fly. This means we could assign M(A) ≡ U (A), and expand the data during the transfer process, as shown in Fig. 1(c) . Rather than expanding the image before the input stage, M(·) expands during data movement, such that replication of data can occur as late as possible to reduce memory bandwidth and buffer size. Note that in order to map the aforementioned equation to (1), M(·) also needs to be applied to the kernel B [see Fig. 3 (b)], but this is efficient and also done on-the-fly.
B. Extending to General Tensor Product
A MERIT transform is coupled with a product function such that (1) represents an algorithm and can run on SIMD efficiently. We can change this function to adapt the equation to different algorithms. For the MERIT GEMM and convolution, we apply the dot-product to each vector pair of the transformed matrices. Similarly, if we define
then we apply our transform to more computations, such as fused ReLU layer or patch distance algorithms like motion estimation. If we couple such vector products to MERIT transform, then we in fact apply vector functions to flattened tensors from 2-D image patches or 3-D convolution kernels. Instead of defining the function on vectorized tensors Vec(T 1 ) Vec(T 2 ), a more natural way is to define a tensor product that computes T 1 T 2 directly. Our proposal for a generalized tensor product, namely, Ranged Inner-Product, is related to the ideas from [28] and [29] , where a vector product is expressed as a strategy class, one of the most common design patterns. An implementation of the strategybased vector product is listed in Listing 1.
Here, programmers can write the ReLU product as a class consisting of three strategy functions and a for-loop. These strategy functions perform partial sum initialization, accumulation, and clipping, respectively. Such a strategy class, when combined with MERIT transform, provides high flexibility for expressing various workloads. For example, if we load another tensor and add it to act in PostLoop, we then obtain an inner-product function for residual layer.
Extending the idea of strategy beyond vectors, programmers can use a more complex class and nested for-loops to implement a 2-D Ranged Inner-Product, which is a 2 × 2 for-loop. As shown in Fig. 4(a) , For 3-D or higher dimensional tensors, however, this would not be suitable because of the resulting deeply nested for-loops, and therefore we propose to linearize the nested loops. Fig. 4(b) shows the function sequences being executed for the 2-D case, where each index corresponds to a certain execution order of the functions. Fig. 4 (c) and (d) shows the efficient ranged inner-product implementation. First, the strategy functions are concatenated to form a single program, and their starting and ending addresses are stored as tables. Given the loop indices (i.e., i, j), the starting and ending addresses are selected from the table which decides the function range that should be executed for those indices. To support higher dimension execution, we can simply enlarge the address tables and implement the selection using parallel prefix sum.
C. Putting It Together
In summary, the MERIT transform alters an input tensor into another tensor, for the purpose of converting parallel algorithms into simpler parallel tensor reductions, or the ranged inner products. This means that in the previous examples [see is based on splitting the tensor indices into two parts. The row indices p reflect the algorithm parallelism, and column indices a reflect the number of accumulated elements in a tensor product. In these figures, the rows and columns of M(A) and M(B) are assigned with grid indices like (0, 0), (0, 1), (0, 2), (1, 0), . . . . We then use the notation M(A) p,a to refer to one element in a transformed tensor, and for convenience, we denote the concatenated indices as k = (p, a). Changing k j , the j th component of k, which belongs to either p or a, is equivalent to moving along a particular axis d j of A with a certain stride s j and offset o j
where δ is the Kronecker delta. The aforementioned equation first calculates the indices x through (p, a) and parameters d j , s j , and o j , then x is used to locate a specific element
Based on the aforementioned discussions, a MERIT transformed tensor should reflect the total complexity and available parallelism of a computation. For example, an (m, k) × (k, n) GEMM problem has a total complexity of (mnk) with (mn) available parallelism, and the sizes of the transformed tensors are [(m, n), (k)]. Similarly, a (k × k)-convolution problem on an (h × w) image has a total complexity of (hwk 2 ) with (hw) available parallelism, and the sizes of the transformed tensors are [(h, w), (k, k)].
We show an example of the MERIT transform of AlexNet CONV1. This layer adopts a stride size 4 and a 11 × 11 kernel size, so it can be written as where I and k are the input feature map and convolution kernel, respectively. Note that here we implicitly define the x i of (5) in the subscripts of the right-hand side expression. Apart from CNN with stride, we can express several popular CNN variants with MERIT transform definition. For example, the dilated CNN [13] , which can effectively increase the receptive field, is defined by
The correlation layer [16] is a CNN layer for optical flow. It simulates the traditional motion estimation computation and can be expressed with
IV. EFFICIENT MERIT TRANSFORM ON GPUs
Modern GPUs have SIMD-like designs that are suitable for workloads targeted by the MERIT transform. However, because of the intricate design of GPU memory hierarchy, care must be taken while fetching data, as the latency introduced by cache misses tends to be fairly high. Traditionally, in order to maximize data locality, programmers are responsible for dividing workloads into independent subtasks and plan the data fetching accordingly. These very tedious tiling or blocking optimization processes can be replaced by factorized MERIT transforms, as shall be explained in the remainder of the section.
A. Tiled Parallel Execution
With the MERIT transform, supporting tiled execution can be simply achieved by dividing M(A) and M(B) into subtensors of sizes (t p , t a ). Fig. 5 highlights some subtensors from Figs. 2 and 3(b) with (t p , t a ) = ((2, 2), (1, 2)) and ((2, 2), (2)), respectively. It can be observed that for an arbitrary tensor A and its transformed tensor M(A), each subtensor of M(A) can be fully contained by a minimal subtensor of A, whose size at the i th dimension is calculated by
where d j and s j represent the same axes and strides as in (5), and t j denotes the size at the j th dimension of the subtensor in the M(A) domain. This allows us to compute the memory footprint of the subtensor. For example, using a 5 × 5 kernel in convolution in order to obtain a 16 × 8 pixel output, we assign (t p , t a ) = ( (16, 8) , (5, 5) ) in (9) , which gives the memory size requirement of (1 + 1(16 − 1) + 1(5 − 1), 1 + 1(8 − 1) + 1(5 − 1)) = (20, 12) . Instead of loading all elements of M(A) directly from A, we first load its subtensor into useraddressable shared memory, and programmatically transform the tensor to the M(A) domain on-the-fly, thereby eliminating all memory bandwidth overhead caused by duplication, since all data are loaded from the shared memory.
B. Bank Conflict Avoidance
In GPUs, the shared memory may consist of many SRAM banks, and the number of banks is usually designed to match the number of processor cores. When the SIMD requests elements from the same SRAM at the same time, the system performance drops drastically, causing the bank-conflict problem. To avoid this problem, note that the vertical direction of the MERIT transformed matrices reflects the parallelism. This property reduces the bank-conflict problem into the analysis of column vector addresses within the subtensors of M(A). Fig. 6 illustrates examples for resolving bank-conflict by padding [40, Ch. 39] , XOR-hash [41] , [42] , and our proposed retiling technique. The example shows a vector processor with eight ALUs computing a 3 × 3 convolution for a 4 × 4 output block, requiring a 6 × 6 block in total [ Fig. 6(i) ]. The right parts of the figure illustrate some arrangements for distributing the 6 × 6 pixel block into eight SRAM banks physically. In Fig. 6 (ii-a), conflicts are highlighted in red, whereas bank conflicts in Fig. 6 (ii-b) are resolved by padding six elements after every six elements. To reduce the padding waste, Fig. 6 (ii-c) uses only two padding with XOR-hash, achieved by swapping the first and last four elements in every other row. Fig. 6 (iii) and (iv) shows two different retiling that can create conflict-free scenarios without any hashing or padding requirement. Later in Section V, we will provide more details on finding suitable configurations and for storing and transferring of SRAM data. We will also discuss our open-source implementation of the MERIT transform on GPUs in more details in Section VI.
V. HARDWARE DESIGN FOR MERIT
A. MERIT-z Vector Processor
As shown earlier, GPUs combined with the MERIT transform framework can greatly simplify the optimization process for parallel algorithms. However, GPUs also come with units that introduce unnecessary complexities in the memory hierarchy, leading to unpredictable performances. In this section, we use the insights gained from the MERIT transform and create a clean-sheet ASIC design aimed to remove these complexities. When done properly, it can outperform GPUs for memory-intensive vision processing applications while maintaining the same programming interfaces as the GPUs. Fig. 7 shows the proposed MERIT-z processor architecture. It includes a MERIT Memory Management Unit (MMU) using classic circuit blocks such as the butterfly network to map common MERIT substeps onto the processor efficiently. It also adopts a quite standard vector processor design, which comprises of several tile accumulation units (TAUs) and one Dispatcher. A TAU is analogous to streaming multiprocessors on GPUs, and is an array of N = 32 ALUs cores in our implementation. Our processor can be configured with variable numbers of TAUs, such that it can be scaled for both lowpower or high-end computing devices. A TAU consists of three coarse-grained pipelines, namely, the read pipeline RP, compute pipeline (CP), and write pipeline (WP). The RPs load tiles from the DRAM into the SRAMs, transform, and feed them into the ALUs; the WP handles data that need to be written to the DRAM. Together, RPs and WP form the MERIT MMU. Fig. 8 shows the RP which consists of multiple SRAM banks and is designed to perform substeps of the MERIT transform. The RP caches individual tensors tiles, according to Fig. 5 from the DRAM, and performs the substep transform on-the-fly when the CP is consuming the tiles. This involves reading out the data from N SRAM banks, shuffling, and feeding the data into N cores. The SRAM acts as a circular first-input-first-ouput (FIFO) as shown in Fig. 9 . Two-port (TP) SRAMs should be adequate for this purpose, or singleport (SP) SRAMs can be used in exchange for performance drop. When the processor requests a tile, we allocate its corresponding space on the SRAM and then mark it as ready after receiving the entire tile from the DRAM. Afterward, the CP can be launched after all dependent tiles are ready. Meanwhile, tiles that are not needed anymore are released due to the nature of circular FIFOs. Because partial tiles are not read out in this design, the circuits handling the read-afterwrite hazards can be greatly simplified. The total SRAM sizes in the two RPs of a TAU are 16 and 8 kB, and they can provide enough bandwidth for many computation tasks. For example, the RPs may hold the weights and feature maps in a CNN, two input matrices in GEMM, or the reference and current frames in motion estimation.
The CP accepts data from the RPs, executes the Ranged Inner-Products, and yields data to the WP. A CP contains an SIMD array and a 5-kB SP SRAM for storing partial results. The SIMD ALUs run a 32-b-ISA that comes with only seven kinds of instructions (see Table I ) for 16-b fixedpoint arithmetic operations, but can support a wide range of applications. The simplicity of this ISA is enabled by offloading the complexity of data movement and address calculation to the MERIT MMU.
The WP does not store any data. Instead, it shuffles and collects data from CP by generating addresses on-the-fly assembling the output lines to aligns with the DRAM or cache lines. Since writing data do not cause any execution dependency, a TAU can never stall at WP unless the DRAM write queue is full.
We select the total sizes of SP SRAM in the two RPs and the CP according to several empirical rules. First, Table II shows  TABLE I   SIMD ISA OF MERIT-Z PROCESSOR   TABLE II BRIEF SUMMARY OF DNN HARDWARE CONFIGURATIONS the total SRAM sizes and ALU number of several popular DNN architectures. While these architectures come with such variant computation ability and datapath, their SRAM and ALU ratio are close, and we also select a ratio in line with them. Second, the two buffers in RPs are not sharable in our architecture, and we find that the buffered tensor sizes in two RPs are mostly unequal. For example, in motion estimation or [16] , the buffered tensors of the searched frame are always larger than those of the current frame. We thus believe that using asymmetric buffer sizes provides better utilization. Third, as we discuss in Fig. 8 , the partial sum is stationary in the buffer while the buffers in RPs are also used for prefetching. This suggests that the buffer in RPs should be a few times larger than that in the CP. Based on these observations, we come up with the 16, 8, and 5 kB configuration.
Overall, a MERIT-z processor has two external interfaces, namely, a data interface which uses the valid/ready channel similar to the standard buses like AXI [43] for both address and data, and a control interface for transmitting the transform parameters d, s, o in (5) and the program shown in Fig. 4 . Unlike many accelerators that tend to consume tens of kilobytes of instruction memory to work, our processor only uses less than 0.5 kB of registers to define a computation kernel.
While most of the memory-related optimization is encoded in the MERIT transform, further optimization on MERIT-z pipeline can be done by overlapping computation and memory fetching as shown in Fig. 10 , since these two operations Fig. 10 . Overlapping computation and prefetching. The MERIT-z processor supports overlapped data movement and computation intrinsically. We can also adopt folding to eliminate wastes. use different hardware units in a TAU. To further eliminate bubbles during warm-up and cool-down stages of the pipeline, a common technique is to reduce the parallelism by forcing independent works to run on the same TAU, which is also known as folding. Instead of creating dedicated hardware units, these pipelining techniques can be implemented entirely in the software when algorithms are expressed in the MERIT transform by merely doubling the width and halve the height of (M (A), M(B) ), and add an extra for-loop level in the Ranged Inner-Products.
B. Architectural Comparison With DNN Accelerators
The MERIT-z processor is more general than dedicated DNN processors, but we can achieve comparable or better performances compared to these processors like the systolic array (TPU) or Eyeriss. We attempt to provide some insights by comparing the dataflow on an architectural level. As shown in Fig. 11 , we illustrate a 3 × 3 CNN workload example, assuming the following architectures: 1) a 16 × 8 systolic array; 2) a 12 × 14 Eyeriss divided into four 3 × 14 groups (as described in their article); and 3) MERIT-z processor with four 32-ALU TAUs. In Table III , we analyze the data-reuse rate of these architectures, which is defined by the MAC count divided by the input and output word count. Among all architectures, MERIT-z has a significantly higher reuse rate than the others. This is why the current MERIT-z implementation does not include an extra memory hierarchy (i.e., global buffer) between the TAUs and DRAM. Now we take a closer view of the dataflow of these architectures. The systolic array is an architecture for matrix multiplication, so we must use nine processing passes to compute the convolution [see Fig. 11(a) ]. Every cycle, every ALU loads two inputs, performs a MAC, and delivers one output value to its neighbor. Overall, each pass inputs an 8n matrix and outputs a 16n matrix, performing 128n MACs in total. This gives the per-ALU reuse rate 1/(1 + 1 + 1) = 0.33 and overall reuse rate 128/(16 + 8) = 5.33. Since systolic array has no local (i.e., per-ALU) storage, the local reuse rate is only 0.33.
Eyeriss is a systolic array variant with extra local storage to improve the reuse rate. Every Eyeriss ALU can hold up to 3 × 4 input feature maps, 3 × 4 × 16 kernels, and 16 partial sums, performing 3 × 4 × 16 MACs [ Fig. 11(b) ]. This almost An apparent problem of the Eyeriss dataflow is the data duplication in local storage. While their multicast connection efficiently broadcasts the kernels to a row of ALUs, it also reduces the equivalent local buffer size by order. On the other hand, while MERIT-z also uses similar input local buffer (ILB) sizes (less than 1 kB each), we aggregate them into a large one with the butterfly network. We prove in Section V-C that the butterfly network can provide full throughput for all supported workloads. The aggregated buffer sizes reach 8-16 kB, allowing MERIT-z to perform a smaller but complete CNN workload [see Fig. 11(c) ], which enables a partial CNN workload in one TAU with a reuse rate of 78.77 as calculated in Table III . This makes MERIT-z more like a no-local-reuse (NLR) architecture in Eyeriss taxonomy. NLR architectures read all data from a larger global buffer and thus achieve a higher ALUs density and data-reuse rate. However, their frequent access to the large global buffers implies high area and power overhead efficiency compared to spatial architectures. Since MERIT-z uses the butterfly network as its datapath, the overhead is rather low compared with classic NLR architectures. Figs. 12 and 13 are our synthesizing results illustrating that in a TAU with 32 ALUs, the two butterfly networks cost only 40% of 32 MAC area. Therefore, we can simultaneously benefit from the high data reuse of NLR architecture as well as the power efficiency of systoliclike architectures. The efficiency of butterfly networks is also demonstrated in accelerators like MAERI [33] as shown in Fig. 14. It uses a Chubby Tree to distribute data from the global buffer to local buffers and collects the partial sum through a configurable adder tree. The Chubby Tree in MAERI provides higher flexibility, but it still has data duplication in its local buffers. Besides, the chubby tree and the classic butterfly network are essentially topologically equivalent.
MERIT-z has several design advantages in its local storage over the other architectures, and it also has advantages due to its simplicity of memory hierarchy. First, the partial sums are written to the DRAM only when the summation tasks are finished, which satisfies the output stationary data reuse, so the output bandwidth is zero in Table III . Second, since the ALUs are organized as a SIMD that executes synchronously, the ALUs always issue vector read and write, and we use a wide SRAM to reduce the amortized partial sum buffer area. Last, since we can use the same input buffer for both prefetching data and supplying data for SIMD (see Fig. 9 ), we can remove the requirement for a dedicated FIFOs that costs 30% of the local buffer area in Eyeriss. For 28-32-nm SP SRAMs less than 1 kB, a 1.30-1.47x area means doubled storage capacity (see Fig. 13 ), and this explains why we can use a larger per-ALU ILB size (0.75 kB) than Eyeriss (0.50 kB).
C. Efficient Memory Distribution Circuit With MERIT
In Section IV-B, we mention that each ALU needs an SRAM bank in order to maximize processor efficiency and avoid bubbles. A static assignment between ALUs and SRAMs can be too inflexible, but a crossbar with (N 2 ) multiplexers would be too power-hungry and complex. In our processor, we find that the classic butterfly network works well with MERIT transform while utilizing only (N lg N) multiplexers, which represents only 30% of the area compared to a full crossbar in our synthesized circuits. In the remainder of the section, we show that, for computer vision algorithms, how a butterfly network can correctly shuffle elements from SRAM banks to ALUs without stalls using the MERIT transform.
Given the access patterns shown in Fig. 6 , we denote the SRAM address used by the nth ALU as A n , and a data element lies in A n belongs to SRAM bank (A n mod 8). In the following discussion, we show the sufficient conditions for hardware correctness and efficiency of MERIT transform implementation. Specifically, we observe that when mapping MERIT transform to SRAM and circuits, no bank conflict can occur, and a butterfly network can distribute the data to ALUs. In Fig. 6(ii-a) , the addresses accessed by the first purple subtile are A 0:7 = (0, 1, 2, 3, 6, 7, 8, 9) . Since the subtile size is the same as the ALU number, the addresses can be formulized as 
Upon careful inspection of this equation, we find that the matrix S exhibits some special attributes that can be represented by a hash property matrix H whose elements h i, j can be computed using the matrix S
where ∧ means the bitwise XOR function, and n ∧ 2 i means flipping the i th bit of n. Readers can verify that we can obtain this property matrix for (11) regardless of the base addresses
The positions of 1s in H are directly related to the power of 2 of the prime factorization of c 0:2 . In the following discussions, we define the operators on symbols (0, 1, x) as the standard operators in ternary logic. Equipped with this hash property matrix, we assert that the sufficient condition for a MERIT transform to map to a butterfly network is whether the matrix H can be reduced to the identity matrix I as follows. First, we select a row that contains only 0s and 1s. Then, we compute the NOT version of this row and apply it to another row with the AND operation. The process is repeated until it produces, or fails to produce, an I. This reduction algorithm is similar to a standard Gaussian elimination process without row swapping. For example
represents compatible address mapping for the butterfly network, while
does not, since we cannot find a row without x to start. In a more complex computation task, such as CNN with strides or dilated CNN, as shown in Fig. 6(ii-c) , H could become a nonsquare matrix. When this happens, we apply the following procedure to map a nonsquare H matrix into Listing 2. A strategy class for bilateral filter. a square one H :
where addition and multiplication are defined as XOR and AND operators in the ternary logic. Equation (16) shows a 4 × 3 property matrix H created using c 0:2 = (4, 8, 3). Two additional matrices, R and X, are introduced to convert it into a squared matrix H . X is an upper triangular matrix that trims H into a square matrix, and it has no more than one off-diagonal term per row. Matrix R is a row permutation matrix for swapping upper rows to the bottom, and it can be applied multiple times. Again, readers can verify that the resulting matrix H fulfills the aforementioned sufficient condition. Since matrices S, R, and X are all binary matrices, we can map the aforementioned math into pure data shuffle circuits and place them in the RP (see Fig. 8 ). For a 32-core TAU, the data come into the SRAM through the collector with a log 2 32 = 5-stage butterfly network, followed by a log 2 5 = 3-stage omega network that implements the aforementioned matrices (X, R). The data move from the SRAM into the CP with another log 2 32 = 5-stage butterfly network. The aforementioned discussions imply that we can use classic data distribution circuit blocks to effectively feed data into the processors to support a wide variety of workloads.
VI. EVALUATION
A. Code Size Reduction With MERIT
In this experiment, we collect fast kernels from opensource projects such as Caffe, OpenCV, and Parboil [39] , [44] , [45] , and rewrite them in order to demonstrate the code size reduction effect with the MERIT transform. Listings 2 and 3 show an example of the bilateral filter [19] implemented as an inner-product strategy class, where the Gaussian weights of spatial kernels are precomputed as lookup-tables. In Loop(), the neighbor pixels and two spatial kernels are packed in the array i. Listing 3 specifies the input as an h × w image divided into 16 × 16 blocks, and each thread with a block performs a k × k local window scan. The middle lines represent the M(·)s, and the range term σ r is passed to the kernel in the last line. Table IV lists the number of lexical tokens required to express different algorithms with MERIT, our näive C++ implementation, and fast open-source kernels for the GPUs. As shown in the table, because code regarding data movement is no longer needed in MERIT, we can greatly reduce the number of both arithmetic operators and identifier tokens even compared to the naïve CPU implementations. On the other hand, although there remains a large gap between MERIT and the heavily tuned cuDNN from NVIDIA, our implementation still surpasses Caffe, which is built on top of the heavily optimized cuBLAS linear algebra library, also from NVIDIA. Also note that we hold an edge against Caffe for smaller strides and larger kernels, showing a performance pattern in line with cuDNN.
B. MERIT Transform Performance on GPUs
C. MERIT-z Processor ASIC Implementation
In Table VI , we compare MERIT-z with several dedicated DNN accelerators [3] , [33] , and Table VII shows the area and power breakdown of our processor. We synthesize the 4-TAU MERIT-z (128 ALUs) at 400 MHz to provide a similar computation power to these works. Our architecture is similarly power-efficient (per MAC) compared to these processors, and it only uses half as much area. While comparing our frontend numbers against published backend metrics plays in our favor, it is clear that we compare favorably against these processors in terms of area and power. Since current MERIT-z implementation excludes the global buffer, it has the highest area efficiency in this table.
These gains come from several factors. First, MERIT transform allows us to use the much smaller SP SRAMs, which are only half size compared to TP SRAMs, and this results in an overall area reduction of over 30% with at most 3% utilization rate drop. Second, as discussed in Section V-B, MERIT-z can achieve a high utilization rate with a comparatively smaller SRAM storage because of the efficient MERIT transform application over the butterfly network and omega network. The simplicity of these classic circuit networks also allows us to run the chip at a much higher frequency because of their circuit routing efficiency.
To ensure evaluation accuracy, we adopt a cycle-accurate DRAM simulator called Ramulator [46] and integrate it into our simulation through Verilog VPI. We choose the 1-rank, 2-channel, DDR3 1600, 2Gbx8, and 3.2-GB/s bandwidth in the simulation settings, and the L1 cache size is set to 1 kB mainly to cover the misaligned DRAM access. Also, we synthesize the MERIT-z using the TSMC 28-nm low-power RVT library with power simulation reported using Synopsys PrimeTime.
In Table VIII , we show the performance analysis of MERIT-z running AlexNet [20] and VGG [21] . In addition, we also benchmark a DNN-based saliency detection algorithm [22] consisting of additional DECONV layers and traditional image operations. In Table IX , we further benchmark several DNN layers developed in the state-of-the-art computer vision conferences [13] , [16] - [18] , and these layers cannot be supported by any existing, single architecture because existing ones are designed for specific tasks. All of these layers are expressed with MERIT transforms, and the optimization techniques mentioned previously can be applied out-of-the-box. For example, with the help of optimization techniques shown Fig. 15 . Utilization scaling of MERIT-z processor. MERIT-z remains highly utilized among a variety of workloads up to 256 ALUs, and for comparison, we also calculate and plot the utilization rate of Eyeriss according to their processing latency. Fig. 10 , we can achieve acceptable efficiency for depth-wise convolution even though this algorithm is mostly memorybounded.
D. Scaling Up MERIT-z Processor
The vector-architecture nature of MERIT-z means that the area can be scaled up fairly linearly when adding more TAUs to MERIT-z. On the other hand, Fig. 15 shows the relative speedup with ALU counts between 32 and 1024, using the same DDR3 3.2-GB/s simulation setting as before. Most algorithms, except for the depth-wise convolution, scale up well up to 256 ALUs, beyond which the applications become DRAM memory-bounded entirely. Since MERIT-z only utilizes one level of the memory hierarchy, we eliminate the need for moving data from the global prefetch buffer to local buffers, which can be a bottleneck reported in [3] . Therefore, for DNN requiring multiple processing passes like VGG CONV1, we achieve particularly higher utilization than other works, and such feature is essential for larger DNN workloads that are likely to appear in the future.
VII. CONCLUSION
In this article, we proposed the MERIT transform, a mathematical framework for transforming vision processing tasks into SIMD-friendly workloads by separating data movement from computations. Since the core of algorithm optimization often lies in the data movement process, it allows us to write fast parallel kernels with very small lines of code. Because this process is similar across different algorithms on the same processor, we created a library for CUDA GPUs so as to free programmers from the burden of repeated optimization efforts, such as thread tiling and shared memory access. We also use these insights to design the MERIT-z processor. The processor can perform a wide range of applications, such as DNN, image processing, and traditional machine-learning applications, with the comparable area and power efficiency to several dedicated DNN processors. It uses classic circuit blocks such as butterfly networks to shuffle data to the ALUs efficiently. Also, we have released both CUDA library and MERIT-z processor implementations under the open-source GPL license.
The mathematical framework has been proved to be a useful utility for verifying the efficiency of algorithms against processors. We shall continue to identify more useful properties for the MERIT transform in order to exploit opportunities for data reuse between processors during execution. We would also like to look into more matching patterns between transforms and circuits, and provide more automatic means for generating these patterns, such that it can become more powerful and useful for the parallel and scientific computing community.
APPENDIX
The CUDA implementation of MERIT can be found in Github under mediaic/UMI, and the SystemVerilog implementation will soon be available under mediaic/MERIT.
