High-level synthesis (HLS) tools have made significant progress in compiling high-level descriptions of computation into highly pipelined register-transfer level (RTL) specifications. The highthroughput computation raises a high data demand. To prevent data accesses from being the bottleneck, on-chip memories are used as data reuse buffers to reduce off-chip accesses. Also memory partitioning is explored to increase the memory bandwidth by scheduling multiple simultaneous memory accesses to different memory banks. Prior work on memory partitioning of data reuse buffers is limited to uniform partitioning. In this paper, we perform an early-stage exploration of non-uniform memory partitioning. We use the stencil computation, a popular communication-intensive application domain, as a case study to show the potential benefits of non-uniform memory partitioning. Our novel method can always achieve the minimum memory size and the minimum number of memory banks, which cannot be guaranteed in any prior work. We develop a generalized microarchitecture to decouple stencil accesses from computation, and an automated design flow to integrate our microarchitecture with the HLS-generated computation kernel for a complete accelerator.
INTRODUCTION
Accelerator-centric architectures can bring 10-100x energy efficiency by offloading computation from general-purpose CPU cores to application-specific accelerators [1] . The engineering cost of designing massive heterogeneous accelerators is high, but can be much reduced by raising their abstraction level beyond RTL to C by high-level synthesis (HLS) [2] . Though HLS tools are good at scheduling computation to the timing slots of clock cycles, they do not optimize data accesses. This weakness motivates recent work on data reuse [3, 4] and memory partitioning [5] [6] [7] [8] [9] in HLS.
External memory bandwidth is a dominant bottleneck for system performance and power consumption. Data reuse is an efficient technique of using on-chip memories to reduce external memory accesses. When an application contains a data array with multiple references, we can allocate a reuse buffer and keep each array element in the buffer from its first access until its last access. Then each array element needs to be fetched from the external memory only once, and the off-chip traffic is reduced to the minimum. Loop Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. transformation can be applied to improve data locality and reduce the size of the data reuse buffer [3, 4] .
When the innermost loop of an application is fully pipelined, an accelerator needs to perform multiple load operations from the same reuse buffer every clock cycle. To avoid contention on memory ports, memory partitioning of the reuse buffer is required. Since the transistor count of memory control logics is proportional to the number of memory banks after partitioning, the optimization goal of memory partitioning is to minimize the number of memory banks. The constraint is that the multiple array elements to be loaded every clock cycle are always stored in different memory banks. The work in [5, 6] provides solid frameworks of memory partitioning. Further optimizations, including memory access rescheduling [7] and multi-dimension arrays [8] , are also proposed. But none of them can guarantee the optimal solution for a given case in terms of the number of banks. The reason is that their optimization space is limited within uniform memory partitioning, i.e., all the memory banks have to be of the same size. It is an unnecessary constraint which was inherited from commodity HLS tools, e.g., [10] . Other work partitions different fields of a single data structure into multiple memory banks for data parallelism based on profiling results [9] . It is orthogonal to the problems on multiple array references in [5] [6] [7] [8] and this paper.
In this paper, we go beyond the limitation of uniform memory partitioning, and propose a novel method based on non-uniform memory partitioning. We found that after removal of this limitation, we can achieve fewer memory banks than the optimal solutions in prior work [5] [6] [7] [8] . As an early-stage exploration of nonuniform memory partitioning, in this paper we focus on stencil computation, a popular communication-intensive application domain. We develop a microarchitecture with novel structures of memory systems which achieve the theoretical minimum number of memory banks for any stencil access patterns. Experimental results show that we can reduce 25 − 100% of variable resources including BRAMs, logic slices, and DSPs compared to prior work [8] , along with slightly improved timing.
PRELIMINARY

Stencil Computation
Stencil computation comprises an important class of kernels in many application domains, such as image processing, constituent kernels in multigrid methods, and partial differential equation solvers. These kernels often contribute to most workloads in these applications. Even in the recent publications on memory partitioning [7, 8] which were developed for general applications, all the benchmarks used in the publications are in fact stencil computation.
The data elements accessed in stencil computation are on a large multi-dimensional grid which usually exceeds on-chip memory capacity. The computation is iterated as a stencil window slides over the grid. In each iteration, the computation kernel accesses all the data points in the stencil window to calculate an output. Both the grid shape and the stencil window can be arbitrary as specific by the given stencil applications. A precise definition of stencil computation can be found in Appendix 9.1. Fig. 1 shows an ex- Figure 1 : Example C code of a typical stencil computation (5-point stencil window in the kernel 'DENOISE' in medical imaging [11] ).
ample stencil computation in the kernel 'DENOISE' in medical imaging [11] . Its grid shape is a 768 × 1024 rectangle, and its stencil window contains 5 points, as shown in Fig. 2 . Five data elements need to be accessed in each iteration. In addition, many data elements will be repeatedly accessed among these iterations. For example, A [2] [2] will be accessed five times, when (i, j) ∈ {(1, 2), (2, 1), (2, 2), (2, 3), (3, 2) }. This leads to high onchip memory port contention and off-chip traffic, especially when the stencil window is large (e.g., after loop fusion of stencil applications for computation reduction as proposed in [12] ). Therefore, during the hardware development of a stencil application, a large portion of engineering effort is spent on data reuse and memory partitioning optimization. Figure 3 : The overall architecture of our microarchitecture for stencil computation. It decouples stencil accesses from computation.
Microarchitecture for Stencil Accesses
In this work we develop a microarchitecture to decouple the stencil access patterns from the computation, as shown in Fig. 3 . The microarchitecture contains multiple memory systems, and each is optimized to a data array with stencil accesses. Since there are no reuse opportunities among different data arrays, the memory systems for different arrays are independent of each other. Each memory system receives a single data stream which iterates on a multi-dimensional grid without any repeated external access. Each memory system contains data reuse buffers, memory controllers and interconnects that have been customized for the access patterns of the data array used in the target stencil computation. A memory system sends data to the computation kernel via each data port associated with each array reference in the original user code. If all the data are consumed by the computation kernel, the memory system will immediately prepare the data used in the next cycle to feed into the fully pipelined computation kernel. With our microarchi- Figure 4 : Example code of the computation kernel where all the memory accesses are offloaded to our microarchitecture. The keyword 'volatile' in the code informs HLS tools of potential data change after access. The pragma 'pipeline is used in Xilinx Vivado HLS [10] to pipeline the innermost loop.
tecture, the C code of the computation kernel can be simplified to Fig. 4 , where users no longer need to optimize memory accesses, which is offloaded to our microarchitecture. Users can assume that each data access to the points should get the same data from our accelerator microarchitecture as the original load operation. The C code of the computation kernel can be compiled by HLS tools for a fully pipelined hardware implementation with the most efficient resource usage.
Targeting Optimal Design
The design target of our microarchitecture has three aspects:
1. Full pipelining. As the stencil window slides every clock cycle, the microarchitecture is able to send out all the data in the stencil window to the computation kernel and get all the data ready for the consecutive accesses in the next cycle.
2. Minimum data reuse buffer size. When a data element is fully reused for each access, it stays in on-chip memories from its first access until its last access. Meanwhile, other elements in the array are loaded from external memory every clock cycle. Therefore, the theoretical minimum size of the reuse buffer for a data array is equal to the maximum lifetime of any element in the array. In the example in Fig. 2 
for the last time. Therefore, the minimum size of the data reuse buffer for array A will be 2048.
3. Minimum number of reuse buffer banks. Suppose the stencil window of an input array contains n points, i.e., there are n data references to the array. It means that each clock cycle, n data elements need to be read out, either from reuse buffers or from the external memory. Suppose we use dual-port memories to implement reuse buffers. One port of a buffer bank is occupied by the replacement of an expired data element with a new element from the external memory every cycle. There is only one port left in each memory bank for us to read the n elements needed by the stencil window. Suppose one of the n element happens to be the new element from the external memory in a certain smart data reuse mechanism. There are still n − 1 data elements to read, and we need at least n − 1 memory banks. In the example of Fig. 2 , n = 5 indicates that we need at least four memory banks. As the stencil window slides, all five data elements in the window should never be in the same bank. This is a tough constraint to satisfy, and prior work [5] [6] [7] [8] had to use more banks to eliminate bank conflicts in difficult cases. Fig. 5 shows that the number of banks ranges from five to eight in [5] as the row size of the data grid changes, even if the stencil window keeps the constant shape in Fig. 2 . The technologies proposed in [7, 8] can keep the number of banks consistently to be five in the case of the stencil window shown in Fig. 2 . However when the Example stencil windows where more banks are needed than the # of array references in [7, 8] . (a) 4-point stencil in 'BICUBIC' [13] . (b) 4-point stencil in 'RICIAN' [14] . (c) 19-point stencil in 'SEGMENTATION_3D' [8] .
stencil window changes to some other shape in other applications, e.g., the ones shown in Fig. 6 , the methods in [7, 8] will need 5, 5, 20 banks respectively, which are larger than the minimum values.
In this work we will present a generalized microarchitecture that can simultaneously achieve these optimal design targets for any application that falls in the category of stencil computation.
METHODOLOGY
Overview
The internal structure of a memory system in our microarchitecture is illustrated by the example in Fig. 7 , which is generated for the stencil computation in Fig. 1 . Suppose the stencil window contains n points (n = 5 in the example of Fig. 1 ). Our memory system will contain n − 1 data reuse FIFOs as well as n data path splitters and n data filters connected together in the way shown in Fig. 7 . The data reuse FIFOs provide the same storage as conventional data reuse buffers, and the data path splitters and filters work as memory controllers and data interconnects. In contrast to conventional cyclic memory partitioning [5, 7, 8] which uses uniform buffer sizes, the sizes of reuse buffers in our design are nonuniform. They are customized to the shape of the stencil window in the target application.
Denotations
To better explain the working principle of our memory system, we provide a table of denotations that will be used in the following sections in Table 1 . The precise definitions of the denotations are given in Appendix 9.1.
Meanings Example by Fig. 2 i loop iteration vector i = (1, 2) A data array the array A with the five references Ax array reference the five references such as 
Working Principle
Since our microarchitecture is based on nonuniform memory partitioning in contrast to uniform partitioning in prior work, the memory controlling mechanism cannot follow the modulo scheduling of data accesses among memory banks in prior work [5, 7, 8] . Instead, our microarchitecture is a novel design based on data streaming. Each module in our design is autonomous and can work in a full pipeline as long as its upstream module produces a data element and its downstream module consumes an element every clock cycle. We will discuss the working principle of our design in terms of function correctness, deadlock-free, minimum reuse buffer size, and minimum number of buffer banks.
Function Correctness
Each data path splitter in Fig. 7 reads any existing data element from its precedent FIFO and sends the data element to the successive FIFO as well as to the data filter below. Each data filter customizes the data stream that flows into the computation kernel to fit the access patterns of the associated array reference. A data filter for an array reference Ax receives the data stream which iterates in DA and sends out the data which iterates in DA x . For example, filter 0 in Fig. 7 sends the data element in set DA 0 = {(i, j)|2 ≤ i ≤ 767, 1 ≤ j ≤ 1022} out of the input data domain DA in Table 1 and discards the first two rows in the 2D grid of Fig. 2 . This guarantees the correctness of the data set sent to each data port of the computation kernel. Due to the property of stencil computation, the data elements accessed by an array reference are in the same lexicographic order as the loop iteration (see Property 1 in Appendix 9.1). Our microarchitecture based on data streaming enforces this order, as long as the input data stream is also in the lexicographic order (i.e., data iterated from innermost loop to outermost loop). The lexicographic order of input data is usually realized without hardware overhead since it fits well with burst accesses to external memory or inter-accelerator communication patterns (see discussion in Appendix 9.3). By providing the correct data set and correct data order to the array references, our design can guarantee that as the stencil window slides, the data elements received by the computation kernel are always consistent with the array references.
Deadlock-Free
Since the modules in our design are autonomous instead of being synchronized by a centralize controller, a critical design challenge is to ensure that these modules are free of deadlock. Deadlock is caused by any potential cycle in the dependency graph of all the The example circuit structure of our memory system generated for array A in the stencil computation of Fig. 1 .
modules. Due to the chaining structure as shown in Fig. 7 , any potential cycle must go through a pair of data filters. Figure 8 : The dependency graph of data filters 'filter_x' and 'filter_y' where x < y is shown in the form of two dependency cycles.
data filters 'filter_x' and 'filter_y' where x < y, we can draw all the dependency relations as shown in Fig. 8 . There are four possible dependency relations, each marked with the condition when the dependency happens. Obviously, the conditions of edge e1 and e2 are mutually exclusive, and the same applies to e3 and e4. Therefore, only pair (e1, e3) and pair (e2, e4) could form two potential dependency cycles respectively, as shown in Fig. 8 . We can prove that e1 will be mutually exclusive to e3, and e2 will be mutually exclusive to e4, as long as we satisfy the following conditions:
1. For data access offsets fx of fy of two array references,
2. For the reuse FIFO between adjacent Ax and Ay,
Both the conditions have very clear physical meanings. The first condition means that a data element accessed by an early array reference can be reused by a late array reference, but not the reverse. The second condition means that if an array reference reuses data elements accessed by another array reference, the reuse buffer size should be greater than or equal to the maximum reuse distance between these two references. The detailed proof of these two deadlock-free conditions can be found in Appendix 9.2.
To ensure the first condition, we can sort the f of the data array references in the descending lexicographic order when we map them to data filters from 0 to n − 1, e.g., Fig. 7 . To ensure the second condition, we calculate the maximum reuse distances of all the pairs of adjacent array references and allocate reuse FIFO sizes accordingly, as shown in Table 2 for the example in Fig. 1 .
FIFO ID precedent/successive references FIFO size physical impl. FIFO 0 
Design Optimality
Minimum Reuse Buffer Size. Due to the linearity of maximum reuse distances, the sum of the sizes of all the reuse FIFOs is equal to the maximum reuse distance between array reference A0 and An−1. Due to the sorting of array references by f in the descending lexicographic order, A0 is the earliest reference and An−1 is the latest reference. Therefore, the total reuse buffer size is equal to the maximum reuse distance between the earliest and latest references, which is the theoretical minimum. As shown in Table 2 for the example in Fig. 1 , the total size is 2048, the same as the minimum value discussed in Section 2.3. If the maximum reuse distance is so large that the buffer sizes exceed the on-chip memory capacity, our microarchitecture allows tradeoff of offchip bandwidth occupation for smaller memory usage (see discussion in Appendix 9.4).
Minimum Number of Buffer Banks. The structure of our design guarantees that for n array references, there are n − 1 buffer banks (reuse FIFOs) as described in Section 3.1. It is the theoretical minimum value as discussed in Section 2.3.
Since our design achieves both the minimum reuse buffer size and the minimum number of buffer banks, our microarchitecture is optimal.
Insights Gained From RTL Simulation
Our microarchitecture is quite different from conventional designs with centralized controllers [5] [6] [7] [8] . The major tasks of a conventional controller includes two aspects:
1. Filling up reuse buffers. Before the computation starts, the controller will first fill reuse buffers with data elements needed by the computation kernel.
2. Evict expired data from reuse buffers. The challenging part in this function is when the reuse distance between array references changes as the execution advances. This often happens on a skewed data grid. In this case, the number of data elements stored in reuse buffers changes as time goes, and the symmetry between read and write is broken.
There is no specific module that takes charge of these key tasks in our microarchitecture. Instead, by observing the execution of our design in RTL simulation, we found that these key tasks are done automatically by the coordination of our distributed modules.
Automatic Filling of Reuse Buffers
The filling process of reuse buffers in our microarchitecture is shown in Table 3 . The data filter 4 associated with
is first stalled at cycle 1. This is when the filter 4 tries to send data A[0][1] to the computation kernel but all the other data filters are discarding this data. As a result, the computation kernel will be waiting for data from the other data filters and will not consume data from the filter 4. This stalling will lead to filling up of data in the FIFO 3 between
. The other four filters will keep the data stream advancing since all of them discard the first row in the data domain. 1023 cycles later, the filter 3 will try to to send data A [1] [0] to the computation kernel but will be stalled as well. Then the FIFO 2 will start being filled up, and the data path splitter s3 will stop sending data to FIFO 3. The following process is similar until FIFO 1 and FIFO 0 are filled up consecutively, as shown in Table 3 . Then at cycle 2049, the filter 0 receives A [2] [1] and will send the data to the computation kernel. All the data at the five inputs of the computation kernel become valid, and the kernel consumes all the five data to produce the first output. Then all the stalled filters can continue to send new data to the computation kernel every clock cycle until the end of the iteration domain.
Automatic Adjustment of Reuse Data Amount
An application can have a skewed data grid as shown in Fig. 9 . This kind of application is usually needed when a rectangular grid is iterated along the 45
• direction after certain loop transform [15] . The skewed grid leads to the challenge that the number of data stored in each reuse buffer will change as the iteration goes on, and often requires a complex memory controlling scheme in a centralized design [5, 7, 8] . However, this difficulty can be automatically handled by our distributed modules. Following the design schematic in Section 3.1, we will order the array references and map them to five filters 0-4. Among them, filter 2 is for reference
A[i][j] and filter 3 is for reference
. Note that h2 of filter 2 advances h3 of filter 3 by one row when h2 and h3 are synchronized by the computation kernel, as shown in Fig. 9 . At each turn around to the next row in the data domain, the filter 3 will fetch one more data from the FIFO splitter than the filter 2 since the filter 3 is iterating over a longer row. Then the number of data stored in FIFO 2 between filter 2 and filter 3 will be reduced by one. This achieves the dynamic adaption of the number of data stored in a reuse buffer to the change of reuse distance in the case of a skewed data grid.
Miscellaneous Design Issues
Heterogeneous Mapping of Reuse Buffers
Reuse buffers in our design have different sizes, as shown in Table 2 , and may prefer different physical implementations. For example, if the target platform is FPGA, the physical implementation candidates include block memory, distributed memory and slice registers. They are efficient for a large buffer, a medium buffer and a small buffer respectively. Table 2 shows the heterogeneous mapping of reuse buffers to different physical implementation.
Data Filter in Polyhedral Domain
Note that though the data domains are rectangles in the example of Fig. 1 , they could be any polyhedrons on a multi-dimensional grid. Comparisons on loop bounds is not a universal solution to data filtering. To select DA x out of DA, the data filter in our microarchitecture is implemented with a data switch controlled by two counters, let's say input counter and output counter, as shown in The input counter proceeds when the filter receives an input data. The output counter proceeds when its value is equal to the input counter. It is also the condition that the data switch forwards the input data to the output. In contrast, when the output counter is not equal to the input counter, the data switch discards the input data.
DESIGN AUTOMATION FLOW
We develop a design automation flow to generate the complete accelerator for a given stencil application, as shown in Fig. 11 . It starts from the original source codes of a user application, e.g., the code in Fig. 1 . In the left branch, we first apply polyhedral analysis to extract the polyhedrons of data arrays with stencil accesses. We calculate the data domain of each array reference and the reuse distance of each pair of adjacent array references. This information is used to instantiate the data filters and reuse FIFOs in our microarchitecture. Then the flow generates a microarchitecture instance, e.g., the design in Fig. 7, with Table 3 : The execution flow of our microarchitecture in the example of Fig. 1 . The latency among the data streams at different modules is ignored here for demonstration purpose only.
stencil accesses in user applications. In the right branch, we first apply source-to-source code transformation to extract the kernel code with pure computation, e.g., Fig. 4 . Then high-level synthesis is applied on the transformed code for a fully pipelined hardware implementation of the computation kernel in RTL. Finally, we integrate the microarchitecture with the computation kernel for a complete accelerator with full pipelining and data reuse, e.g., the design in Fig. 3 .
EXPERIMENTS
Experiment Setup
Our polyhedral analysis in Fig. 11 is implemented by the LLVMPolly framework [16] . The kernel transformation is performed by the open source compiler infrastructure ROSE [17] , and the highlevel synthesis is performed by Xilinx Vivado HLS [10] . Although our methodology is applicable to both ASIC and FPGA designs, we choose FPGA as the target device in this work due to the availability of downstream behavioral synthesis and implementation tools. The Xilinx Virtex7 FPGA XC7VX485T and ISE 14.2 tool suite [18] are used in our experiments. The target clock frequency is set at 200MHz.
The benchmarks used in prior memory partitioning work [7, 8] make up a rich set of real-life stencil computation kernels. Among them, we select the more challenging benchmarks with non-rectangular stencil windows for our experiments. DENOISE (2D/3D), RICIAN (2D), and SEGMENTATION (3D) are from medical imaging [11] . BICUBIC (2D) is from bicubic interpolation process [13] . SOBEL (2D) is from Sobel edge detection algorithm [14] . We choose the more recent memory partitioning work [8] as our experiment baseline.
Results
Original Table 4 : High-level partitioning results.
The comparison results of memory partitioning are shown in Table 4. We list the pipeline II of the original user codes which suffer memory port contentions before memory partitioning, which is equal to the number of memory load operations on the data array. We also list the II that the computation kernel targets to achieve via memory partitioning. The number and total size of reuse buffer banks are reported for both [8] and our method. The buffer size is in the unit of data element. As shown in Tabel 4, our method saves the partitioning bank number of all of the six benchmarks. In addition, our method does not need the padding technique in [8] which increases the grid size at certain dimensions to relax the partitioning complexity. Our methods saves the buffer size, especially when the padding introduces more overhead in a high-dimensional data grid, e.g., Fig. 6(c The post-synthesis results are listed in Table 5 . Physical resource usage (block RAMs, logic slices, and DSPs) and timing information are extracted from Xilinx ISE report. As shown in Table 5 , we use 66% fewer block RAMs than [8] . This stems from 1) the minimum number of buffer banks achieved, and 2) the heterogeneous mapping of buffer banks to variable resources in addition to block RAMs as demonstrated in Table 2 . We also use 25% fewer logic slices than [8] , even though we implement some of the small reuse buffers in registers. That is because we avoid the modulo scheduling in conventional uniform memory partitioning which generates a hardware transformer to map the original data address to the bank ID and local address via a complex calculation involving multiplication and division. Instead our memory system only needs counters iterating over the data domains in the lexicographic order. This advantage is also reflected by the complete elimination of DPSs in our method. The clock period does not show too much difference between [8] and our method since the back-end flow will stop optimization as long as it meets the 200MHz target. However, our method generally has larger slacks from the target 5.0ns as shown in Table 5 . It is mainly due to the distributed structure in our method. We tried to use the Xilinx XPower Analyzer for power estimation, but found that the FPGA power is dominated by the static power, and is almost invariant with custom circuits. If power gating is available in FPGA, the FPGA power will be proportional to resource usage, which is covered by Table 5 .
CONCLUSIONS AND FUTURE WORK
In this work, we propose non-uniform partitioning which opens a new design space compared to conventional cyclic partitioning framework. As a starting point, we use stencil computation as the initial design target and show a novel memory system that works with non-uniform sizes of reuse buffer. The memory system in the extended design space can achieve the optimal solution with the minimum reuse buffer size and the minimum number of buffer banks. We develop a design automation flow that generates a microarchitecture with our memory systems and integrates it with the computation kernel for a complete design. Experimental results show that our method outperforms the recent memory partitioning work in terms of utilization of variable FPGA resources.
As this is the first work on non-uniform memory partitioning, our primary goal is to show the potential of this new approach. Though stencil computation is a popular application domain and attracts the attention of most memory partitioning work, it is still important to extend non-uniform memory partitioning to general cases. Our data streaming method may not be the only solution for utilizing the non-uniform reuse buffers. A modified modulo scheduling extended from conventional uniform memory partitioning is also a good candidate. We believe that there are many opportunities in future research.
SUPPLEMENTARY MATERIALS
Polyhedral Model
We support general stencil accesses where neither the data grid nor the stencil window needs to be rectangular. It is often the case when stencil computation is applied with the loop transformation based on polyhedral model [3, 4, 15] before memory access optimization. Our work is also built on top of polyhedral model for generality as below. Several properties of stencil computation are identified under the framework of polyhedral model as well. 
This definition of iteration domain can describe any polyhedral shape on a multi-dimensional data grid, e.g., rectangular, triangular, diamond. DEFINITION 2 (LEXICOGRAPHIC ORDER [19] ). Lexicographic order relation l of two iteration vectors i and j is defined as:
DEFINITION 3 (ACCESS FUNCTION [19] ). For a k-dimensional array reference, its access function H ⊆ Z m → Z k is the mapping from iteration vector i to the access index h: 
DEFINITION 4 (STENCIL COMPUTATION).
Under the framework of polyhedral model, stencil computation is defined as: the access function of each array reference satisfies
where I is the identify matrix.
It means that for any reference Ax of a data array A,
DEFINITION 5 (DATA DOMAIN). For an array reference Ax, data domain of DA x is the set of data elements accessed by Ax, and is expressed by a set of linear inequalities
Followed by Eq. (3), we have the following property for stencil computation.
PROPERTY 1 (LEXICOGRAPHIC ACCESS PATTERN)
. Data elements in DA x are accessed by the array reference Ax in the lexicographic order.
DEFINITION 6 (INPUT DATA DOMAIN).
For an data array A with n references A0, A1,...,An−1, the input data domain of A is the union set of data elements that need to be fetched from external memory, defined as (c) Our accelerator with a single data reference can be easily co-optimized with another accelerator by loop transformation for data forwarding. Figure 13 : Integration of an accelerator with our microarchitecture in a system. Note that our methodology transforms the original accelerator with multiple data references, e.g., Fig. 13a , to a new accelerator with a single reference, e.g., Fig. 13b . An accelerator with our microarchitecture can be easily connected to an offchip prefetch module with bus burst accesses. The prefetch module can directly forward the data stream from the bus pipeline to the accelerator and only needs a small buffer to hide the bus latency.
In addition, our microarchitecture can simplify the co-optimization of accelerators for inter-block communication. In the example of Fig. 13c, accelerator 1 
in the order of j and i. The load operations of accelerator 2 are very different from the store operation of accelerator 1. To implement the inter-block communication, accelerator 1 has to write the 1022×766 output data block to an on-chip block memory first. After then accelerator 2 reads data from the block memory. With our microarchitecture, however, the input data stream of accelerator 2 is transformed to A[i][j], which has the same form as the output of accelerator 1. It allows us to apply loop transformation so that the data orders at the output of accelerator 1 and the input of accelerator 2 are the same, e.g., enabling the technology in [15] . Then data can be directly forwarded from accelerator 1 to accelerator 2 without a large on-chip block memory.
In summary, our microarchitecture facilitates system-level synthesis in terms of optimization on module-level communication and offchip communication.
Enabling Bandwidth/Memory Tradeoff
There is tradeoff between offchip bandwidth and on-chip memory usage. When we have larger offchip bandwidth, we can use smaller on-chip memory. The extreme case is that when we have sufficiently large offchip bandwidth, we do not need any on-chip memory for data reuse. The different pairs of offchip bandwidth occupation and on-chip memory usage form the points on a design curve. When we move along the curve, the array references to go through reuse buffers will change. The problem is that conventional memory partitioning does not guarantee an optimal solution for an arbitrary access pattern. Optimality of memory partitioning may get lost when we perform bandwidth/memory tradeoff. One of the benefits of our microarchitecture is that it will automatically reduce the buffer size to the minimum when the offchip bandwidth is increased but keep the optimal design structure. As shown in Fig. 14, when we allow one more offchip access for each clock cycle, we will pick up the largest reuse buffer in our microarchitecture in Fig. 14(a) , and replace it with the input data stream from offchip in Fig. 14(b) . By doing so, we achieve a graceful degradation of on-chip memory usage with the increase of offchip bandwidth, as shown in Fig. 15 We use the 19-point stencil in SEGMENTA-TION in Fig. 6(c) as an example, and sweep the number of offchip accesses per cycle from 1 to 18. The three phases in Fig. 15 show that we first give up inter-plane data reuse which need large buffers, and then give up inter-row data reuse which need medium buffers, and finally give up intra-row data reuse which need small buffers.
Difference from Memory Partitioning for Distributed Computing
This work mainly targets memory partitioning for behavior synthesis. In parallel, memory partitioning for distributed computing has been studied for decades [20] . However there are fundamental differences between these two scenarios. The first difference is Fig. 6(c) .
that memory partitioning in behavioral synthesis must meet cyclelevel data access constraints to avoid simultaneous accesses on the same memory block. Therefore, memory partitioning and memory scheduling in behavioral synthesis should be an integrated process. The second difference is that in distributed computing, all the data elements accessed by a memory reference have to be bound to the specific local memory bank (or processing unit) to which the reference is mapped. In behavioral synthesis multiple accesses of the same memory reference can access different memory banks in different loop iterations, which will greatly expand the solution space. A third difference is that data arrays are typically partitioned into a fixed number of banks determined by the hardware configuration(proportional to the number of processors) in distributed computing, while the number of partitioned banks in behavioral synthesis can be an arbitrary number determined by the data access pattern in a particular application.
