Abstract-Achieving optimal throughput by extracting parallelism in behavioral synthesis often exaggerates memory bottleneck issues. Data partitioning is an important technique for increasing memory bandwidth by scheduling multiple simultaneous memory accesses to different memory banks. In this paper we present a vertical memory partitioning and scheduling algorithm that can generate a valid partition scheme for arbitrary affine memory inputs. It does this by arranging non-conflicting memory accesses across the border of loop iterations. A mixed memory partitioning and scheduling algorithm is also proposed to combine the advantages of the vertical and other state-of-art algorithms. A set of theorems is provided as criteria for selecting a valid partitioning scheme. This is followed by an optimal and scalable memory scheduling algorithm. By utilizing the property of constant strides between memory addresses in successive loop iterations, an address translation optimization technique for an arbitrary partition factor is proposed to improve performance, area and energy efficiency. Experimental results show that on a set of real-world medical image processing kernels, the proposed mixed algorithm with address translation optimization can gain speed-up, area reduction and power savings of 15.8%, 36% and 32.4% respectively, compared to the state-of-art memory partitioning algorithm.
I. INTRODUCTION
With the exponentially increasing complexity in modern SoC designs, behavioral synthesis is gradually being accepted by the industry. For example, the AutoESL behavioral synthesis tool [1, 2] is now part of the Vivado Design Suit available to all Xilinx FPGA designs. By transforming untimed algorithmic descriptions into hardware implementations, behavioral synthesis can significantly reduce time-to-market and design cost with acceptable performance and power penalties. Typical applications for behavioral synthesis are dataintensive or computation-intensive kernels in signal processing and multimedia applications, where general-purpose processors often fail to meet the performance/power requirements. Such computation kernels are usually loops that manipulate multiple data elements simultaneously from arrays. Loop pipelining [3] is a common optimization technique that overlaps different loop iterations to increase performance by minimizing initiation interval (II). While more computation units can be added to exploit loop-level parallelism for arithmetic/logic operations, the support of multiple memory accesses efficiently is a key problem to utilizing the potential performance gain made available by loop pipelining.
It would be expensive and non-scalable in terms of both cost and power to simply increase the number of memory ports [4] . Moreover, for reconfigurable platforms such as FPGAs, the number of ports of on-chip block RAM is fixed. Duplicating the target array into multiple copies can support multiple simultaneous read operations with significant area and power overhead, but it doesn't support simultaneous writes. A better approach is to divide the original data array into several disjoint memory banks using memory partitioning. At compile time, the behavioral synthesis tool can statically analyze the data access pattern of the target array and avoid the conflicts among memory accesses by partitioning the array into different memory banks.
In parallel, memory partitioning for distributed computing has been studied for decades, where each processing unit accesses its local memory [5] [6] [7] . The ideas of some memory partitioning algorithms in distributed computing can be applied to memory partitioning in behavioral synthesis. For example, the algorithm in [8] that partitions memory into multiple banks to avoid communication between multiple tiles on a single chip is similar to the vertical partitioning algorithm proposed in this paper. However, there are also some fundamental differences between these two scenarios. The first difference is that memory partitioning in behavioral synthesis must meet cyclelevel data access constraints to avoid simultaneous accesses on the same memory block. Therefore, memory partitioning and memory scheduling in behavioral synthesis should be an integrated process. The second difference is that in distributed computing, all the data elements accessed by a memory reference have to be bound to the specific local memory bank (or processing unit) to which the reference is mapped. In behavioral synthesis multiple accesses of the same memory reference can access different memory banks in different loop it-erations, which will greatly expand the solution space. A third difference is that data arrays are typically partitioned into a fixed number of banks determined by the hardware configuration(proportional to the number of processors) in distributed computing, while the number of partitioned banks, or partition factor in behavioral synthesis can be an arbitrary number determined by the data access pattern in a particular application.
The works that are most relevant to this paper are [9] and [10] . Research in [9] attempts to partition and schedule multiple memory references on a data array in the same loop iteration to multiple cyclic banks to avoid confliction. Memory padding was introduced before memory partitioning to handle memory references with modulo operations [10] . While these works take a first step towards efficient memory support for loop pipelining in behavioral synthesis, the algorithms generate inefficient results for some inputs, as shown in the motivational examples in Section II.
In this paper a vertical memory partitioning and scheduling algorithm, or a vertical MPS for short, is developed where multiple accesses of the same memory reference in different loop iterations are scheduled to different memory banks. In contrast, approaches in [9] [10] [11] that schedule multiple memory references in the same loop iteration to non-conflicting memory banks are referred to as horizontal MPS in this paper. We show that the vertical MPS can generate valid solutions for arbitrary affine memory references 1 within a loop for any fixed memory port constraint. Furthermore, a mixed partitioning and scheduling algorithm, or a mixed MPS, that combines the advantages of both the horizontal and vertical MPS is proposed where different memory references in different iterations on an array can be scheduled simultaneously and efficiently to non-conflicting memory banks.
Traditionally, partition factors which are powers of 2 are always preferred to other factors since modulo and divide operations can be transformed into shift operations that are suitable for hardware implementation. In this paper arbitrary partition factors are supported using a novel address translation technique that considers the regularity of affine memory accesses between adjacent loop iterations, so that a larger design space can be explored for better results.
Our contributions include the following: (i) A vertical and a mixed memory partitioning and scheduling algorithm for efficiently supporting arbitrary multiple affine memory references in a loop in behavioral synthesis. (ii) An optimal and scalable memory scheduling algorithm finding the maximum matching with minimum cost on the bipartite memory scheduling graph. (iii) An optimized address translation with arbitrary partition factors which are not powers of 2.
Experimental results show that on a set of real-world medical image processing kernels, the proposed mixed MPS algorithm with address translation optimization can gain speed-up, area reduction, and power saving of 15.8%, 36% and 32.4% respectively, compared to the horizontal MPS. 1 The address of an affine memory reference is a linear combination of loop induction variables. Research in [12] shows that the majority of array references in loop kernels are affine memory references.
The remainder of the paper is organized as follows. Section II gives a motivational example for our memory partitioning and scheduling problem. Section III formulates our problem of memory partitioning and scheduling. Section IV presents proposed memory partitioning and scheduling algorithms. Section V reports experimental results and is followed by conclusions in Section VI.
II. DEFINITIONS AND A MOTIVATIONAL EXAMPLE
In this paper, we focus on partitioning and scheduling multiple memory accesses to different memory banks to support simultaneous memory accesses in loop pipelining. For simplicity, loop stride is assumed to be 1 in this paper. Algorithms and formulations are easily extended for any constant loop stride. Assume that there are m affine memory references R 1 :a 1 *i+b 1 , R 2 :a 2 *i+b 2 , …, R m :a m *i+b m on the same array in the target loop without dependency constraints among them. R jk is used to represent the k-th loop iteration of R j , whose address is a j *k+b j . Common variables in this paper are shown in Table 1 .
DEFINITION 1 (MEMORY PARTITION)
. A Memory partition is described as a function P which maps array access R jk to partitioned memory banks, i.e., P(R jk ) is the memory bank index that R jk belongs to after partitioning. EXAMPLE 1. Cyclic partitioning (shown in Figure 1 ): ( )
In this paper, cyclic partitioning is used as the memory partitioning scheme where N is the partition factor.
DEFINITION 2 (MEMORY SCHEDULE)
. A Memory schedule is described as a function T which maps array access R jk to its execution cycles, i.e., T(R jk ) is the cycle to which R jk is scheduled. research in [9] shows that the valid partition factor N using the horizontal MPS must satisfy (1).
Equation (1) shows that horizontal MPS fails if , or generates large partition factors if is a large prime number, as shown in Table 2 . To address this problem, vertical schedule is proposed.
DEFINITION 4 (VERTICAL SCHEDULE).
A Vertical schedule is a memory schedule with scheduling function T that satisfies:
where N is the partition factor.
EXAMPLE 3. Vertical scheduling (II=1):
( ) , c is a constant.
The difference between the horizontal and vertical MPS can be illustrated using Figure 2 . In Section IV, we will show that the vertical MPS guarantees valid solutions for arbitrary affine memory inputs, although it may generate worse results for some inputs than the horizontal MPS (shown in Table 1 ).
A mixed memory partitioning and scheduling algorithm is proposed to combine the advantages of both the horizontal and vertical MPS algorithms. Using mixed MPS, different memory references in different iterations on an array can be scheduled simultaneously to non-conflicting memory banks.
We use a real-world application, denoise [13] as an example to demonstrate the design trade-offs in the memory partitioning and scheduling problem. A simplified source code for denoise is shown in Figure 3 (a). The value of an element is accumulated with all its neighbors in 8*8*8 three-dimensional 
for(j = 1; j < 7; j++) space to filter out noises. In the innermost loop, there are 7 data accesses (C, R, L, D, U, O, I for center, right, left, down, up, zout and zin) to the same array u. If the target loop is to be fully pipelined using single-port memory banks, array u has to be cyclic partitioned to multiple (at least 7) memory banks.
Using the horizontal MPS, seven data references on array u in the same i-th Figure 3(c) . Therefore, array u needs to be partitioned into 10 memory banks.
Scheduling results using the vertical MPS is shown in Figure 3(d) . In the first cycle, accesses to u[C] in 7 successive loop iterations can be loaded simultaneously if the array is partitioned into 7 cyclic banks. The loaded values are buffered into temporal registers for future use. In the following cycles,
and u [I] in the 6 successive loop iterations are also loaded into temporal registers. Accumulation of data values will start at cycle 7 and u [C] in the next 7 loop iterations will be loaded in buffers. Compared to the horizontal MPS, the vertical MPS can reduce the partition factor from 10 to 7, but it adds 6 extra cycle latencies for the whole loop with 42 registers overhead.
Scheduling results using the mixed MPS are shown in Figure 3(e) . In the example, [I] i are scheduled to 7 cyclic banks. Compared to the vertical MPS, a 2-cycle-latency and 25 registers can be saved using the mixed MPS. Compared to the horizontal MPS, 3 memory banks can be saved using the vertical and mixed MPS algorithms.
III. PROBLEM FORMULATION
From the motivational example, we can see that vertical and mixed schedules can potentially reduce the number of partitioned memory banks and thus the cost of the overall memory subsystem. These are the problems: how to find valid partition factors, how to find the memory scheduling with minimum cost for a given partition factor and how to find the best partition and schedule. DEFINITION 5 (VALID MEMORY SCHEDULE). Given a loopbased computation kernel with m affine memory references R 1 , R 2 , …, R m on the same array, the target throughput requirement II, the number of memory ports p, and partition factor N, a valid memory schedule is one memory schedule that satisfies both throughput and memory port requirements.
(2) B tl ={R jk | = t and = l}
where R jk is scheduled to T(R jk ) with loop prolog c. Equation (2) formulates memory throughput requirement.
is the set of all the memory accesses which access memory bank l in cycle t, and (3) formulates the port number requirement. DEFINITION 6 (VALID MEMORY SCHEDULE SET). A valid memory schedule set S N is a set of valid memory schedules.
DEFINITION 7 (VALID PARTITION FACTOR SET).
A valid partition factor set VS is a set of partition factors with valid memory schedules, i.e., VS={N | S N ≠ }.
VS h , VS v and VS m are used to represent the valid partition factor set solved by the horizontal, vertical and mixed algorithms respectively.
The memory partitioning and scheduling problem can be divided into the three problems formulated below.
PROBLEM 1 (MEMORY PARTITIONING). Given a loop-based computation kernel with m affine memory references R 1 , R 2 , …, R m on the same array, target throughput requirement II, number of memory ports p, find the valid partition factor set VS.
PROBLEM 2 (MEMORY SCHEDULING). Given a loop-based computation kernel with m affine memory references R 1 , R 2 , …, R m on the same array, target throughput requirement II, number of memory ports p, a platform-dependent cost function, and a valid partition factor N∈VS, find the memory schedule 
IV. PARTITIONING AND SCHEDULING ALGORITHMS
Algorithm 1 is the proposed memory partitioning and scheduling algorithm used to solve Problem 3. Partition factors are enumerated and evaluated from the minimum possible partition factor for m memory references. Line 9 tests whether N is a valid partition factor (Problem 1, to be solved in Section IV.A). Line 11 finds the optimal schedule for the valid partition factor N (Problem 2, to be solved in Section IV.B). Line 12 estimates the cost of a schedule (to be discussed in Section IV.C). The cost's lower bound (to be discussed in Section IV.C) is a monotonically increasing function with respect to N; thus, the exit condition can be tested at line 8 when the cost's lower bound becomes greater than the minimum cost bound. 
A. Memory Partitioning Algorithm

1)
Vertical Partitioning Algorithm Vertical MPS schedules memory accesses of the same memory reference in successive loop iterations simultaneously to different memory banks. The constraints for the vertical partition for fully pipelining (II=1) and single-port memories are:
PROOF.
Proof omitted due to page limit. Theorem 1 implies that for any memory reference patterns , because we can always find a feasible N as for the conditions above. Although other valid partition factors could be much smaller,
gives an upper bound of valid solutions. This means that arbitrary affine memory references in a loop can be fully pipelined by the vertical MPS.
Although it is easy to determine whether a given integer satisfies (5), finding an explicit expression of the minimal cyclic partition factor is not straightforward. Fortunately, in realworld applications, in affine memory references are relatively small, so the upper-bound is also a moderate number. Enumeration from m to find the minimal cyclic partition factor N will not be a compute-intensive work.
2)
Mixed Partitioning Algorithm As described in the motivational example, the mixed MPS schedules memory accesses of the different memory references in successive loop iterations to different memory banks in different cycles.
Considering , only memory accesses in the first N iterations are considered in memory partitioning. Memory accesses in later iterations (k>N) can be partitioned and scheduled using the same pattern based on modulo scheduling. DEFINITION 8 (CONFLICT GRAPH). Given m memory references R m on the same array, and cyclic partition factor N, a conflict graph G(V,E) is a undirected graph where (0 ≤j<m, 0 ) corresponds to memory access R j in the kth loop iteration, and edge ( iff . The conflict graph reflects pairwise conflict information between two memory accesses. Note that congruence modulo is a transitive relation, so each connected component in a conflict graph is a clique. DEFINITION 9 (INTRA-REFERENCE CONFLICT GRAPH). The jth intra-reference conflict graph G j (V j , E j ) is a subgraph of a conflict graph G where (0 ) ∈ V j , and edge ( ) iff . DEFINITION 10 (CONFLICT SET). The conflict set S G (key) of a conflict graph G defined as . All elements in a conflict set are connected by a clique in G. Figure 4 shows the conflict graph of two memory references R 1 : 9*i+1 and R 2 : 4*i+1 with partition factor of 6. Since each connected component in a conflict graph is a clique, only the spanning tree is shown in the figure for simplicity.
Conflict set: ; ;
;. Conflict set of each column:
where
Proof omitted due to page limit. The term in (6) represents whether | | , or whether the j-th intra-reference conflict graph has a conflict set with key k. Given input memory references, can be calculated using (7) . Therefore, (6) can be used to determine whether a given integer N is a valid partition factor. As in the vertical MPS, enumeration from m/(II*p) can be used to find valid partition factors.
B. Memory Scheduling
As formulated in Problem 2, the memory scheduling problem is to find the valid schedule with minimum cost for a given valid partition factor . Considering , only memory accesses in the first N iterations are considered in memory scheduling. Memory accesses in later iterations can also be scheduled according to the first N iterations.
A memory bank can be accessed by different array accesses in different cycles. To model this, a memory bank can be viewed as multiple virtual slots in different cycles.
DEFINITION 11 (VIRTUAL MEMORY SLOT). A virtual memory slot
is the virtual instance of the g-th port of memory bank l at cycle h.
EXAMPLE 4. Virtual memory slot:
Suppose II=1, p=2, N=2, the memory system has 8 virtual memory slots: S 000 , S 001 , S 010 , S 011 , S 100 , S 101 , S 110 , S 111 .
With the concept of the virtual memory slot, in a valid memory schedule at most one memory access is scheduled to any virtual memory slot. The entire scheduling space can be described using a memory-scheduling graph.
DEFINITION 12 (MEMORY-SCHEDULING GRAPH)
. Given m memory references R 0, R 1, …, R m-1 on the same array, cyclic partition factor , a memory-scheduling graph SG( , E) is a undirected bipartite graph where { | corresponds to memory access R j in the k-th loop iteration, corresponds to the virtual instance of the g-th port of memory bank l in cycle h, and edge ( iff .
An edge ( in a memory scheduling graph means that the memory reference R jk can be scheduled to the virtual memory slot . An optimal memory schedule is a maximum matching on the bipartite graph SG, where each memory access is scheduled to a virtual memory slot, and each virtual memory slot will serve at most one memory access. For a given , area of memory and address translation logic is fixed. Therefore, we formulate the cost of a memory schedule as the number of buffer registers needed. Suppose is matched to . No buffer registers are needed if h=k*II, when the k-th iteration of memory reference R j is scheduled to the l-th memory bank in cycle k*II. A read/write buffer register is needed if h≠k*II. If h>k*II and R j is a read operation, R jk can be scheduled to cycle h-N using modulo scheduling [14] . So the weight of an edge can be defined as:
With these definitions, the optimal memory-scheduling problem can be converted to the problem of finding the maximum matching with minimum cost on the weighted bipartite memory-scheduling graph; this can be solved by the Hungarian algorithm in polynomial time [15] .
C. Cost Optimization and Estimation
1)
Address Translation Optimization(ATO) As shown in Section II, modulo and divide operations are used in address translation for cyclic partitioning. If partition factor N is a power of 2, the modulo and divide operations can be easily done by selecting bits from the input addresses. Otherwise, they have to be implemented using non-trivial logic resources. This is why designers are usually encouraged or even restricted to use powers of 2 as partition factors which may generate suboptimal results.
Instead of random addresses, addresses for affine memory accesses within loops are much more regular with constant stride between adjacent iterations. Considering this, the bank_id i+1 and offset_within_bank i+1 in the (i+1)-th loop iteration can be calculated using bank_id i and offset i in the previous i-th loop iteration.
Suppose a = k*N + l (0≤l<N), then
Bank_id 0 and offset 0 can be calculated statically by behavioral synthesis tools at design time. At run-time, bank_id i+1 and offset i+1 can be generated from buffered bank_id i and offset i in the previous iteration 2 . Instead of expensive modulo and divide operation, the proposed address translation optimization (ATO) technique only uses simple operations (1 compare, 1 sub and 2 add operations) and two registers, which will greatly improve performance, area and energy efficiency. With ATO, the address translation cost for arbitrary partition factors which are not powers of 2 is greatly reduced. Thus, a larger design space can be explored to obtain better results. Figure 6 shows the block diagram of a partitioned memory system. It consists of memory banks, address translation unit, control FSM, possible read/write buffer registers, N input MUXs and m output MUXs.
2) Overhead Estimization
The overhead of the partitioned memory system can be estimated using platform-specific cost functions, which can be area-or power-oriented. Take the FPGA platform as an example: the number of BRAMs is ⌈⌈ ⌉ ⌉. The cost of the control FSM unit is proportional to N. With the proposed address translation optimization technique, the cost of an address translation unit is proportional to the number of memory references m and independent of partition factor N. The number of buffer registers REG_N can be calculated by finding minimum matching on the bipartite memory-scheduling graph described in Section IV.B. The number of inputs to the k-th input MUX is ∑ where is defined in (7) . C MUX (m) is the plat-form dependent cost of m-input multiplexer. The number of inputs to the j-th output MUX is N/gcd(a j , N) . Therefore, the cost of optimal memory scheduling with partition factor N is illustrated by (10) for FPGAs where are platform-dependent parameters. Cost lbound in (11) is monotonically increasing with N, and thus can be used in Algorithm 1 as the exit condition.
V. EXPERIMENTAL RESULTS
A. Experiment Setup
Horizontal, vertical and mixed MPS algorithms are implemented as a source-to-source transformation pass. Loop kernels in behavioral languages like C and design constrains including memory port limitation and target II are taken as input. The memory partitioning and scheduling results are dumped into transformed source programs and accepted by the downstream behavioral synthesis tools.
Our test cases include a set of real-life medical imaging processing kernels: denoise, registration, binarization, segmentation and deconvolution [16] . All of these kernels have abundant memory accesses to the same image data array and are perfect examples for testing our MPS algorithms.
Although our algorithm is applicable to both ASIC and FPGA designs, we chose FPGA as the target device in this work because of the availability of downstream behavioral synthesis and implementation tools. The Xilinx Virtex-6 FPGA, AutoESL 2011.4 and ISE 13.2 tools are used in our experiments. Area utilization and critical path are reported by ISE, and power data is reported by AutoESL.
B. Case Study: Denoise
The denoise program is used as a case study to compare various approaches in memory partitioning and scheduling. Loop II and memory port number p is set to 1 in the experiment. The test results are shown in Table 3 .
Horizontal, vertical and mixed memory algorithms are applied to the design. For each kind of algorithm, partition factor can be an arbitrary number or restricted to power of 2. Address translation optimization (ATO) can be applied to partition factors which are not powers of 2.
From the results, we can see that compared to the horizontal MPS, the vertical MPS can reduce the number of block RAMs at the cost of slices and DSPs due to the complex address translation patterns. The mixed MPS is always better than both the horizontal and vertical MPS algorithms in terms of area, power and latency. The ATO techniques can be used to reduce both area and power by reducing the number of DSPs significantly. With ATO, the minimum partition factor is preferred to the partition factor with power of 2. Among all approaches, mixed-ATO shows the best performance, area-efficiency and power-efficiency. Experimental results of all other test cases are consistent with these observations.
C. Test Results
Test results on all five test cases are listed in Table 4 . The horizontal MPS and mixed MPS with ATO are compared in terms of power, critical path delay, the number of slices, block RAMs and DSPs. On average, our proposed mixed MPS with ATO can improve area efficiency by 38.9%, 36% and 99.1% in terms of slices, block RAMs and DSPs compared to the state-of-art horizontal MPS algorithm. A significant reduction in DSPs is mainly achieved by using ATO techniques. The mixed MPS with ATO can also improve power efficiency and performance by 32.4% and 15.8%. 
VI. CONCLUSION
In this paper we propose a vertical and a mixed memory partitioning and scheduling algorithm. Our algorithm can generate optimal memory partitioning and scheduling schemes for arbitrary affine memory inputs by arranging non-conflicting memory accesses across the border of loop iterations. By utilizing the property of constant strides between successive loop iterations, we propose an address translation optimization for an arbitrary partition factor to improve performance, area and energy efficiency. Experimental results show that on a set of real-world medical image processing kernels, the proposed mixed MPS algorithm with address translation optimization can gain speed-up, area reduction and power savings of 15.8%, 36% and 32.4% respectively, compared to the state-of-art memory partitioning and scheduling algorithm. 
VII. ACKNOWLEDGEMENT
