Uniform nested loops are broadly used in scientific and multi-dimensional digital signal processing applications. Due to the amount of data handled by such applications, on-chip memory is required to improve the data access and overall system performance. In this study, a static data scheduling method, carrot-hole data scheduling, is proposed for multi-dimensional applications, in order to control the data traffic between different levels of memory. Based on this data schedule, optimal partitioning and scheduling are selected. Experiments show that by using this technique, on-chip memory misses are significantly reduced as compared to results obtained from traditional methods. The carrot-hole data scheduling method is proven to obtain smallest on-chip memory misses compared with other linear scheduling and partitioning schemes.
Introduction
In general, scientific computing and digital signal processing applications, such as image and video processing, require high-speed processing power. General-purpose processors m ay not be able to meet the requirements on throughput, area or power consumption due to the overhead required to provide the generality. However, application-specific processors are appropriate candidates to meet the requirements for these applications. For applications processing multidimensional signals, a large amount of intermediate data needs to be stored. Thus, the data access time may dominate the total execution time, and the reduction of the average data-access time becomes crucial for the performance of the system.
One common technique for data access time optimization is the use of memory hierarchy, so that most of data is accessed on the fastest memory level. The advance of VLSI technology allows us to have both powerful computation ability and fast memory in a single chip. The access speed of on-chip memory is usually much faster than that of off-chip memory. However, due to chip size constraints, on-chip memory may not be large enough to hold all the data necessary throughout the computation. The rule of thumb for the system to increase the speed, is to schedule the data smartly into different memory levels in order to minimize the number of on-chip memory misses. At the same time, the techniques to maximize on-chip memory utilization can also be applied to reduce their size for a given time requirement.
Scheduling usually refers to the procedure of mapping computations into different control steps [8] . This is referred as execution scheduling in this study. In this paper we propose a new optimization stage called data scheduling to indicate when and where to allocate and transfer data in a memory hierarchy. For a general purpose application, it is known to be difficult to find the optimal data scheduling for the hierarchical memory. However, since the most time-critical sections in DSP and scientific calculations usually consist of nested loops in which loop-carried dependencies are uniform, to find an optimal application-specific data scheduling is possible.
In this paper we present an application-specific data scheduling technique which can reduce the total number of on-chip memory misses dramatically compared with common used methods, such as LRU (Least Recently Used) or FIFO (First In First Out). We show that data scheduling and execution scheduling must be considered at the same time in order to achieve the best result. Since on-chip memory has limited size, partitioning the whole problem space into subproblems is also essential. Our integrated technique is used to minimize the off-chip memory accesses for a limited size on-chip memory, given a single processor and a large prob-lem space.
Initial research on the partitioning problem focuses on the distribution of tasks among groups of processors or mapping the algorithms into systolic or wavefront array process ors without considering the memory access overhead [3, 9, 10] or on-chip memory size constraints [11] . For general-purpose multi-processor systems, the partitioning and scheduling algorithms need to consider the communication cost among processors. For example, Agarwal et al.
presents a theoretical framework to derive the shapes of the iteration-space partitions of do loops to minimize the traffic in multi-processors with local memory [2] . But they assume that iterations can be executed in parallel and the local memory of each processor is large enough for each processor's computation share.
The pattern of references among arrays on nested loops is analyzed in [5] , duplicate or nonduplicate approaches are used to distribute the data into multi-processors to reduce or eliminate the interprocessor communication. However, this technique can only be applied to a subset of uniform loops. The work by Passos and Sha [12] uses a multi-dimensional retiming technique to obtain full parallelism without considering the memory access overhead. These previous studies provide guidelines on the partitioning and scheduling of the problem we are considering, however, instead of achieving parallelism, the objective of our work is to generate an optimal data and execution schedule as well as an efficient partition to minimize the secondary memory access for the whole iteration space.
Relevant work for memory estimation and management on application specific ICs is discussed in [14, 15, 16] . In [14] , the number of memory locations for array structures described in the SILAGE language is estimated. In [15, 16] , a methodology to investigate the background memory requirements of algebraic algorithms is proposed. However, they do not consider the memory hierarchy beyond registers.
In our paper, we provide an efficient way to decide the partitioning and data scheduling scheme for problems with a regular dependence pattern carried out by a single processor. The objective is to minimize on-chip memory misses according to the on-chip storage size constraint. For example, a typical loop with loop carried dependencies is shown below:
for j = 1 to 1000 for i = 1 to 1000
end for end for
In this case, suppose that on-chip memory is large enough to hold 500 data items at the same time. By using our technique, the total number of on-chip memory misses is about 6000 including 4000 unavoidable misses associated with the input data. However, if FIFO or LRU cache replacement strategies were used for the same problem, the on-chip memory misses would be near 1,000,000.
The two key procedures used in our technique are called carrot-hole data scheduling and working set partitioning schemes. They are the basis of the memory traffic minimization method proposed in this paper, which is organized as follows: the next section shows the basic concepts and problem description. A new scheme of data scheduling is introduced in Section 3. Section 4 explores the relations among scheduling, partitioning and the cache size. Section 5 discusses how to minimize the total on-chip memory misses. Finally, some experimental results are shown in Section 6.
Background
The target problems considered in our paper belong to a class of algorithms whose computations are identical over the entire index space. Due to their uniformity, a multi-dimensional data flow graph can be used to represent the loop body, and dependency vectors can be used to represent the loop-carry dependencies [4, 12] . Our discussion in this paper focuses on the uniform nested loops which can be represented in a 2-dimensional index space. Below we introduce some important definitions of our model. where n indicates the number of dimensions. Each index node represents one execution of all the computation nodes in G. is the set of edges corresponding to the nonzero delay edges in G (i.e. loop-carry dependences). w(e); e 2 , is the data reference transmitted by the edge e. ample of the iteration space T. Note that throughout the paper, "nodes" and "edges" refer to those in T if there is no other specific indication. We say that a data item is the result generated by a certain computation in one iteration of the iteration space T. Two edges e 1 ; e 2 2 are said equivalent, denoted by e 1 e 2 , if w(e 1 ) has the same data item as w(e 2 ). (1,1) Figure 1 : Example of multi-dimensional data flow graph and iteration space.
Edges from different nodes in T cannot be equivalent because they reference the data items produced in different iterations. The concept of dependence equivalent class is introduced to indicate that the same data item is referenced by different nodes in T. Definition 2.3 For a given iteration space T = (I 2 ; ; w), a dependence equivalent class Q with respect to e; e 2 , is the set of dependence edges fe i j e i 2 ; e i eg.
The equivalent class defined in T can be easily generalized to define an equivalent class in a multi-dimensional data flow graph G, which is called vector equivalent class. Definition 2.4 For a given iteration space T = (I 2 ; ; w) and G = (V; E; d), a vector equivalent class is the set of dependence vectors in G whose corresponding dependence edges in T are in the same dependence equivalent class.
In the example of figure 1, vectors e 1 and e 2 in the multi-dimensional data flow graph are in the same vector equivalent class. For any node V ij in T, the edges in the (1; 0) and (1,1) directions are in the same dependence equivalent class. Corresponding to the two vector equivalent classes, there are two data items which are produced by statements (1) and (3) in the loop body.
In this paper, " " and " " are used to indicate multiplication, and " " is the dot-production of vectors. For a vectord in a 2-dimensional space, d:x and d:y represent the components in x and y directions.
Execution scheduling for the Iteration Space
As mentioned before, a single processor is used to compute all the nodes in T sequentially.
Determining the execution sequence for the computation nodes in the iteration space consists of choosing a linear scheduling and a linear partitioning. These two stages are called execution scheduling in our paper. 
Data Scheduling
Data scheduling answers the question: "When and where should the data be loaded and transferred while executing the application program?" In contrast to cache replacement strategies for general purpose computers, data scheduling is done statically before execution. The onchip memory size and the specific data scheduling scheme are very relevant to the size of a partition. A block working set, defined below, is used when we focus on the discussion of a single partition in the iteration space. One partition is equivalent to a block working set in this paper. A data scheduling scheme is the scheme used during the compilation phase to determine when and where to load or move a data item within the iteration space in a memory hierarchy.
The minimal necessary on-chip memory size of a data scheduling scheme A for a block working set W, denoted by MC(W; A), is defined to be the minimum size of the on-chip memory so that no extra on-chip memory miss happens during the execution of the block working set using the data scheduling scheme A. (1) and (2) are satisfied.
1. It is being referenced or has been referenced.
2. It will be referenced later.
Notice that the instance in which the data is being produced is also treated as one reference to that data. Conditions 1 and 2 guarantee that any data occupies only one on-chip memory entry during its lifetime. The theorem below shows that carrot-hole property is very valuable for the on-chip memory utilization. This can be proven by using the definition of the carrot-hole property.
The Framework of The Carrot-hole Data Scheduling
The carrot-hole data scheduling refers to the whole procedure of our scheme in order to minimize the number of on-chip memory misses. Partitioning and scheduling are two critical stages. After we obtain an appropriate partition and execution schedule, we can guarantee that the carrot-hole property is enforced by using simple rules of data scheduling for each data item, called Data Scheduling Rules. The first and second rules describe where to put a data item generated by a node, and the third one tells which data items should be removed from the on-chip memory.
Rule 1.
After data item x associated with a dependence equivalent class Q is generated, x is put into on-chip memory if there is an edge e in Q pointing to a node inside the current partition. Rule 2. After a data item x associated with a dependence equivalent class Q is generated, x is put into off-chip memory if there is an edge e in Q pointing to a node outside the current partition.
Rule 3.
A data item x associated with a dependence equivalent class Q can be removed from on-chip memory if no edge in Q will be referenced in the current partition. This is decided based on the longest delay vector with respect to the execution schedule inside the partition.
Partition and scheduling vectors as well as the partition size need to be properly selected such that the above data scheduling rules ensure the carrot-hole property. The execution schedule and the data schedule are tightly combined together in our scheme, as we will see in the following sections.
Minimal Necessary On-chip Memory Size
As mentioned before, validP ,s andh are those which can determine a feasible execution sequence for a realizable iteration space. All of them have to follow the precedences imposed by the data dependencies. It is well-known that the following properties are true [10] . 
Partition for a Single Dependence Vector
The purpose of the partitioning scheme is to guarantee that on-chip memory misses only happen to the nodes close to the boundaries of a partition. For a given on-chip memory size, we try to enlarge the partition size and reduce the misses for each partition, while maintaining the carrot-hole property by our data scheduling scheme.
We begin with the special cases = (0; 1). In Figure 2 other. This set of integral points has a significant importance on the partition scheme, we call it a Period Region of a dependence vectord, denoted by PR(d). The results generated by every node inside this region are not consumed and thus can not be eliminated from the on-chip memory until the next region begins to be executed. Therefore, the on-chip memory size must be large enough to hold data items computed in the whole period region. Since we are dealing with one dependence vector, the number of data items generated by the nodes in a period region is the same as the number of nodes in the region, denoted by node total , and it dominates the necessary on-chip memory size. This number increases linearly with the partition width. The number of data items produced inside the period region and consumed outside the partition is denoted by node out . Since these data items will not be referenced in the current partition, they
should not be held in the on-chip memory. Before the nodes in the second period region begin to consume the data items generated in the first region, data items acquired from the neighbor partition (nodes drawn as crosses in the figure) , denoted by node extra , have to be located in the on-chip memory.
We use the notationṽ ? to represent a vector orthogonal toṽ. It can be shown by geometric constructions that for a given iteration space T with the dependence vectord, scheduling vector s and partition width b, the number of nodes on a scheduling hyperplane bounded by a partitioñ P is Back to the example in Figure 2(b) , we can find node out = 3, and node extra = 1. 
Minimization of On-chip Memory Misses
After partitioning, the total number of on-chip memory misses for the whole iteration space is the summation of the basic on-chip memory misses for each partition, represented by:
Total misses = Basic misses per partition Number of partitions (2) Here we assume that the iteration space is large enough such that after partitioning, any irregular shape partitions near the boundaries of the iteration space can be neglected. The following lemma provides the formulation to calculate the number of total on-chip memory misses. Proof: Let's consider an arbitrary partition as shown in figure 3 (a). P 1 and P 2 are two partition boundaries and ED is a line parallel to the x-axis representing an arbitrary row in the partition. We move the dependence vectord = (d:x; d:y) and let it point from the boundary of the partition to the point F on ED. From the geometric properties, the integral points in EF (point F excluded) are those nodes in this row which need the data from the neighbor partition. For a dependence vectord, and partition vectorP = (P:x; P:y), we get jEFj = jd:x ? d:y . We can see that the nodes in the shorter segment EF must reference the data items already referenced by the nodes in the longer segment AB.
According to the carrot-hole data scheduling rules, the data items required by the nodes on This can be proven by contradiction, assuming that there exists some other schedule vector that is optimal. Similarly, we find that the total number of on-chip memory misses becomes a function of the partition vectorP . Then we can determine the optimal partition vector as follows.
Theorem 5.5
For a given iteration space T with multi-dimensional data flow graph G = (V; E; d), the optimal partition vector is orthogonal to one of the outmost dependence vectors.
Carrot-hole Data Scheduling Algorithm
For a given on-chip memory size constraint, the partition vector, the scheduling vector and the suitable partition width are determined statically by using the algorithm described below, so that the carrot-hole property will be enforced throughout the execution.
Carrot-hole data scheduling algorithm:
1. Model the problem as a multi-dimensional data flow graph G, and find the vector equivalent classes for the dependence vectors.
2. Find the two potential optimal scheduling vectors according to theorem 5.4.
3. Find the two potential optimal partition vectors according to theorem 5.5. 
Experiments
In this section we present the application of our method to a well-known example taken from [6] . The loop body is:
In the example, the variable a(i; j) is produced by statement S 1 (i; j) and consumed by statement S 4 (i; j + 1), hence, there is a uniform dependence from S 1 to S 4 of (0; 1). Similarly, we find that all dependencies are uniform. The data dependence graph of the example is shown in figure 4(a) . The data dependencies are shown in Figure 4 (b) and Figure 4 (c).
On-chip memory size C and iteration space width N and length M are also known as inputs. According to the algorithm, schedule and partition vectors are determined first. By Theorem 5.4 and Theorem 5.5, the schedule vector is (1; ?6) and the partition vector is (0; 1). Figure 6 shows the results for the carrot-hole and FIFO algorithms for various C; M and N:
For the iteration space size 500 1000 and on-chip memory size 500, Figure 5 shows the performance of carrot-hole data scheduling method compared to the LRU and FIFO techniques for different block sizes. Our method gives much better results than both other approaches.
The optimal partition block size is 53. Even though the FIFO and LRU approaches can reach to their best performance at a particular partition size, it is hard to find these positions. In the carrot-hole data scheduling, the optimal partition size is calculated. The on-chip memory Table 1 : Results of LRU, FIFO and Carrot-hole Scheduling.
misses of carrot-hole data scheduling method is 38350, while LRU and FIFO are 511659 and 512279 respectively when the partition size is 53. For different available on-chip memory size, our method becomes more efficient when the available on-chip memory size becomes larger for the same problem. Table 1 lists the performance results for several examples taken from [6] and [13] . The iteration space sizes are all 1000 1000. When the problem size and on-chip memory size increase, this ratio could be even larger.
Conclusion
Most of the previous work in high-level synthesis has not considered the effects of memory hierarchy in the performance of the final design. In this paper, a static data scheduling method, carrot-hole data scheduling, was presented to control the data traffic between different levels of memory. This scheme consists of a data scheduling phase and the selection of optimal partition and scheduling vectors. This approach minimizes the total on-chip memory misses by holding the carrot-hole property, which ensures that on-chip memory misses only occur to the nodes along the partition boundaries. Experiments show that by using this technique, onchip memory misses are significantly reduced as compared to results obtained from traditional methods.
