Massive uniform nested loops are broadly used in multi-dimensional DSP applications. Due to the large amount of data handled by such applications, the optimization of data accesses by fully utilizing the local memory and minimizing communication overhead is important in order to improve the overall system performance. Most of the traditional partition strategies do not consider the e ect of data access on the computational performance. In this study, a multi-level partitioning method, based on a static data scheduling technique known as carrot-hole data scheduling, is proposed to control the data tra c between di erent levels of memory. Based on this data schedule, optimal partition vector, scheduling vector and the partition size are chosen in such a way to minimize communication overhead. Non-homogeneous size partitions are the nal result of the partition scheme which produces a signi cant performance improvement. Experiments show that by using this technique, local memory misses are signi cantly reduced as compared to results obtained from traditional methods. This method can be used in application speci c DSP system design and compiler for DSP processors.
Introduction
Most of the digital signal processing applications are both computation intensive and data intensive. Strategies have been studied focusing on how to partition the problem according to the number of available processors in order to increase the parallelism and system throughput due to limited computation resources. However, data access overhead is another important issue a ecting the system performance. For example, the ltering of a two-dimensional image produced by the Thematic Mapper scanner of the satellite LANDSAT, is equivalent to process an image of 6000 6000 points. Each point when computed uses data generated by previously processed points. Therefore, the amount of data being handled during the computation process may introduce signi cant overhead in the total execution time. This paper investigates the partitioning problem for uniform nested loops and presents a multilevel partitioning and scheduling algorithm to reduce the data access and communication overhead of a multiprocessor system. Due to its characteristics, the technique proposed in this paper can be used either in compilers for DSP processors or application speci c processor design for DSP applications.
Two basic partitioning methods, LSGP (Local Sequential and Global Parallel) and LPGS (Local Parallel and Global Sequential) are discussed in 17] and 19]. In both approaches, partitions are determined purely by the number of available processors. In 5], the bu er length needed by these two approaches is calculated, and it concludes that the internal memory needed for LSGP is equivalent to the external communication of LPGS. Di erent data accesses in a hierarchical memory architecture and the communication cost between partitions are not considered. In 17] , the I/O speed between the array processor and external mass data storage media is assumed to be balanced. A detailed partition scheme was proposed in 21] where the iteration space was partitioned into bands and the nodes in every band can be simultaneously executed. The size of the band is determined by the number of processors. This method can be classi ed into the LPGS category. The communication between bands is ful lled by FIFO queues.
indices, however the solution will be equivalent to a local minimum. Partitioning, also known as tiling in the compiler research community, has been target of extensive studies 3, 8, 15, 26, 27, 29, 30] . However, such studies do not consider possible local memory size constraints. A more in-depth study of this problem is presented in 6] which considers xed size caches in the tiling process, however it does not present an option to minimize the cache size in a computer design case. The objective of our technique is to allow the design of application speci c multi-processor systems, with increased system throughput by fully utilizing the processors and minimizing the communication between them. A homogeneous tightly-coupled multiprocessor system with shared and local memories, and dedicated communication links, is assumed to be the target architecture. The local memory will be also referred as cache memory or just cache, however, it will not have any speci c characteristic such as lines and set association, usually found in commercial systems. The number of processors and the local memory size are considered to be two important constraints. The number of partitions, partition sizes and their shapes are determined by both of those constraints. The data scheduling method improves the performance of an individual processor and of the multi-processor system by applying the pipeline technique. The execution schedule for the partitions, the execution inside each partition and the processor assignment are carefully studied.
Let's examine the two-dimensional nested loop problem posted by Darte The implementation of such a loop when K = 2 in a multi-processor system with local memory able to store 1000 data values and utilizing the LRU approach for data replacement, would require 1:7 million external memory accesses if the loop body was executed 10 6 times. A slight change in the replacement strategy in such a way to follow a FIFO criteria would increase such memory accesses over 1:75 million. The same problem, when submitted to the carrot-hole data scheduling proposed in this paper, has its external memory accesses reduced to 32 thousands. Such a reduction is equivalent to a gain of 55 times on the memory access performance. To the author's knowledge, no other method is able to produce such a signi cant improvement. This paper is organized as follows. The next section shows the basic concepts and the formal problem description. The carrot-hole data scheduling scheme according to local memory constraints is introduced in Section 3. Section 4 discusses the optimal partitions by local memory consideration. Section 5 explores the partition sizes, considering both the local memory constraints and the number of available processors. Section 6 is the result of some examples and precedes the conclusion of this study.
Background
The target problems considered in this paper belong to the class of DSP problems whose computations are identical over the entire index space. Due to their uniformity, such problems can be easily modeled by graphs. In the digital signal processing area, such graphs are the combination of nodes representing tasks (operations) and edges containing delay operators in the z-transform notation 11]. In the design of wave digital lters, the graphs are slightly modi ed such that the delays are represented by storage elements with time interval components in each of the dimensions of the problem 13]. Studies in systolic array architectures use the concept of a reduced graph, where the nodes are the system variables and the edges represent dependencies by a vector label indicating distance between iterations 3, 25, 38] . In this study, a more general graph model was adopted, where the nodes represent tasks or operations while the edges, also labeled by vectors, represent the dependencies. This model which combines the advantages and generality of the previously mentioned representations is called a multi-dimensional data ow graph and are used, in this paper, to represent the loop body 7, 23] . We focus on the uniform nested loops which can be represented in a 2- (1,1) The Iteration Space T = (I n ; ; w) with respect to the multi-dimensional data ow graph G is the replication of iteration instances represented by G in the index space I n , where n indicates the number of dimensions. For a two dimensional data ow graph G, n is 2. Each index node represents one execution of all the computation nodes in G. is the set of edges corresponding to the nonzero delay edges in G (i.e. loop-carry dependencies). Edges representing inputs are regarded as coming from nodes outside the index space. The function w(e); e 2 , represents the data reference on the edge e. shows the iteration space T, and gure 1(c) shows the corresponding nested loop. Note that throughout the paper, \nodes" and \edges" refer to those in T if there is no other speci c indication.
A data item is used to indicate the result generated by a certain computation within one iteration in the iteration space. If a set of outgoing edges of a certain node represents di erent references to the same data item, those edges are said to be in a dependence equivalent class. The equivalent class in the iteration space T can be generalized to de ne the equivalent class in a multi-dimensional data ow graph, which is called vector equivalent class.
De nition 2.1 For a given iteration space T = (I 2 ; ; w) and its corresponding multidimensional data ow graph G = (V; E; d), a vector equivalent class is de ned to be the set of dependence vectors in G whose corresponding dependence edges in T are in the same dependence equivalent class.
In the example shown in gure 1, vectors e 1 and e 2 are in the same vector equivalent class. For any node V ij in T, the edges going in the (1; 0) and (1; 1) directions are in the same dependence equivalent class. Corresponding to the two vector equivalent classes, there are two data items which are produced by statements (1) and (3) in the loop body.
Execution scheduling for the Iteration Space
If we consider a single processor being used to compute all the nodes in T sequentially, determining the execution sequence for the computation nodes in the iteration space consists of choosing a linear scheduling and a linear partitioning. These two stages are called execution scheduling in our paper.
A linear schedule is based on a set of parallel and uniformly spaced hyperplanes in T. The nodes on a hyperplane are executed sequentially. These hyperplanes are called scheduling hyperplanes. In the two dimensional case, the direction of each scheduling hyperplane is denoted by a vectorh. The scheduling vector, denoted bys, is the normal vector to the scheduling hyperplane, used to indicate the execution sequence of the scheduling hyperplanes in the iteration space. For a row wise computation,s is (0; 1) andh is (1; 0).
Homogeneous rectangle or parallelogram partitions (see Figure 2 ) are considered in this paper. They are simple and work e ciently in order to utilize the limited on-chip memory. We de ne a partition by the width of the partition, denoted by b, and a partition vector, denoted byP . We assume that the regular partitions for the iteration space always intersect with the upper and lower bounds of the index space. Hence the partition width is the number of nodes in the horizontal direction for one partition in the iteration space. The input data for each partition come from neighbor partitions or the original input. The partition vector is the normal vector to the partition boundaries. An iteration space is realizable if there exist suitableP,s,h and b to get an execution sequence.
There are two kinds of partitions discussed in this paper, one is partitions by local cache which is determined by the size of available local memory of each processor and is discussed in section 4. The other is partitions by PE, in which the number of partitions is equal to the number of processors. The former is the traditional partition method in a LSGP approach. The multi-level partition scheme proposed in the paper is to decide the nal partitions by examining these two kinds of partitions, and the nal partition size does not need to be the same as any one of them.
Inside each partition, a linear execution schedule for the processor working on this partition is based on a set of parallel and uniformly spaced hyperplanes in the iteration space, bounded by the partition boundaries. The nodes on one hyperplane are executed consecutively. These hyperplanes are called scheduling hyperplanes. In the two-dimensional case, the direction of each scheduling hyperplane is denoted by a vectorh. The scheduling vector, denoted bys, is the normal vector to the scheduling hyperplane. It indicates the execution sequence of the scheduling hyperplanes inside a partition. For a schedule vectors = (s x ; s y ), we assume that the sequential execution along every scheduling hyperplane follows the directionh given byh = (s y ; ?s x ), which is equivalent to a left-to-right sequence in most of our examples. For example, for a row-wise computation,s is (0; 1) andh is (1; 0).
We say that an iteration space is realizable if there exists a suitableP ,s,h and b such that an execution sequence exists for its nodes.
Carrot-hole Data Scheduling
A new concept of data scheduling is introduced in this section. A processor working on a partition applies a data scheduling scheme to determine when the data is loaded into the local memory, and when they are put into the shared memory instead of using a traditional cache replacement strategy. The scheme is designed to fully utilize the available local memory and to minimize the local memory misses which cause data accesses to the shared memory or data transfer between the processors, because of the fact that the local memory access is much faster than the interprocessor communication and shared memory access.
Basic Concepts
The on-chip memory size and the speci c data scheduling scheme are very relevant to the size of a partition. A block working set, denoted as W = (I . The blocking working set is used when we focus on the discussion of a single partition in the iteration space. One partition is equivalent to a block working set in this paper. The following de nitions are used to formalize the on-chip memory misses of a partition. I.e., the input edges of a block working set are those edges not coming from the nodes inside the same block working set. De nition 3.3 The extra on-chip memory misses of a block working set W are the total on-chip memory misses during execution of W, except the basic on-chip memory misses of W.
In gure 3, the edges drawn in dotted arrows represent the input edges of the current partition. The on-chip memory misses caused by those dotted edges are the basic on-chip memory misses of this working set. Any on-chip memory miss happening besides those basic misses is extra on-chip memory misses of this working set.
We say that a data scheduling scheme is the scheme used during the compilation phase to determine when and where to load or move a data item within the iteration space in a memory hierarchy.
For a given partition of a iteration space and a particular data scheduling scheme, the minimal necessary cache size, denoted by MC A , is de ned to be the minimum size of the cache so that no extra cache misses happen during the execution of the partition by using the data scheduling scheme A.
The minimal necessary on-chip memory size is used to measure the performance of di erent data scheduling method. To the sense of on-chip memory utilization e ciency, one data scheduling scheme A 1 is better than a scheme A 2 if it requires smaller minimal necessary on-chip memory size for a certain working set W, i.e., MC A 1 < MC A 2 . The basic idea of our partitioning method is to determine the size of the partition by local cache such that the minimal necessary cache size for this partition is the same as the available local memory of a PE. A data scheduling scheme is said to have the carrot-hole property 1 if it satis es the following de nition.
De nition 3.4 Carrot-hole property: during the execution of an application, a data item is in the on-chip memory if and only if (1) and (2) are satis ed.
1. The data item is being referenced or has been referenced by the nodes inside the current block working set.
2. The data item will be referenced by the nodes inside the current block working set.
Notice that the instance in which the data is being produced is also treated as one reference to that data. conditions 1 and 2 guarantee that any data occupies only one on-chip memory entry during its lifetime. The theorem below shows the link between the carrot-hole property and the minimal on-chip memory size.
Theorem 3.1 For a given block working set W, and a xed execution sequence (execution scheduling), a data scheduling scheme has the minimal necessary on-chip memory size if it has the carrot-hole property.
Proof: For a block working set W and a same execution sequence, by the de nition of the carrot-hole property, the minimal necessary on-chip memory sizes are the same for di erent schemes which have the carrot-hole property.
Assume that there is a data scheduling scheme S 1 that has the carrot-hole property but its minimum necessary on-chip memory size is not the smallest one. According to the reason above, there must be a data scheduling scheme S 2 which does not have carrot-hole property but its minimal necessary on-chip memory size is even smaller. According to the de nition of the carrot-hole property, S 2 must present at least one case of the following three:
1. At least one data item is in the on-chip memory which will never be used again.
2. At least one data item is stored in the on-chip memory before its rst reference.
3. At least one data is taken out from the on-chip memory and it will be reference again in the future.
If S 2 has the case 1, a scheduling scheme S 3 which follows the carrot-hole property can be applied, such that when that data is last referenced, it will be taken out from the on-chip memory. This memory entry can be reused again, and therefore, S 3 uses at least the same amount of on-chip memory as S 2 , which contradicts the assumption that S 2 has less minimal necessary on-chip memory size. If S 2 has the case 2, a scheduling scheme S 3 which follows the carrot-hole property can be applied, such that instead of putting the data into the on-chip memory before it is referenced, the data will be put into the on-chip memory at its rst reference. So before its rst reference, the empty entry can be used by some other data, which implies that S 3 has at least the same minimal necessary on-chip memory size as S 2 , which contradicts the assumption.
If S 2 has the case 3, because the minimal necessary on-chip memory size does not permit extra on-chip memory misses, this case is not valid. 2 
The Framework of The Carrot-hole Data Scheduling
The carrot-hole data scheduling refers to the whole procedure of our scheme in order to minimize the number of on-chip memory misses. Partitioning and scheduling are two critical stages. After we obtain an appropriate partition and execution schedule, we can guarantee that the carrot-hole property is enforced by using simple rules of data scheduling for each data item, called Data Scheduling Rules. The rst and second rules describe where to put a data item generated by a node, and the third one tells which data items should be removed from the on-chip memory. Rule 1. After data item x associated with a dependence equivalent class Q is generated, x is put into on-chip memory if there is an edge e in Q pointing to a node inside the current partition.
Rule 2, After a data item x associated with a dependence equivalent class Q is generated, x is put into o -chip memory if there is an edge e in Q pointing to a node outside the current partition.
Rule 3. A data item x associated with a dependence equivalent class Q can be removed from on-chip memory if no edge in Q will be referenced in the current partition. This is decided based on the longest delay vector with respect to the execution schedule inside the partition.
The example shown in gure 3 will help to illustrate the rules on a speci c block working set W after appropriate partitions and schedules are selected. Vectorsd 1 andd 2 are dependence vectors. The dotted edges are the inputs for the original problem or the results generated by the neighbor partitions, i.e., they belong to Input(W). For the nodes in the inner part of the partition, we see two cases: nodes with edges crossing the partition boundary, and nodes without edges crossing the partition boundary. For example, for node b, according to Rule 1, the result is put into the on-chip memory. According to Rule 3, the data item associated with edge e 3 is removed from the on-chip memory, since e 3 is the longest edge among the edges in its corresponding equivalent class. For the nodes near the partition boundaries such as node a, e 1 and e 2 must be removed using Rule 3, since e 2 is the last reference to its data item inside this partition. For the cases where a data item is used by some nodes inside the current partition as well as the neighbor partition, this data item must be stored in both on-chip and o -chip memories at the time when it is generated, according to rules 1 and 2. For example, node a puts the produced data item in on-chip and o -chip memory.
Partition and scheduling vectors as well as the partition size need to be properly selected such that the above data scheduling rules ensure the carrot-hole property. The execution schedule and the data schedule are tightly combined together in our scheme. Two potential pairs of optimal schedule vectors and partition vectors can be determined for the iteration space, as explained in Sections 5.3 and 5.4. For each pair of selected partition and schedule vectors, the partition width such that no extra on-chip memory miss occurs can be selected, and the total number of on-chip memory misses computed. By choosing the pair which gives the smaller number of on-chip memory misses, and deciding the data schedule by analyzing the dependences according to the data scheduling rules, we can de ne the carrot-hole data schedule.
The carrot-hole data scheduling will be used later in our multi-level partitioning scheme. The minimal necessary cache size is calculated based on the carrot-hole scheduling. Then the partition by local cache and the execution schedule can be selected as discussed in the next section.
Partition with respect to Local Memory
As mentioned before, validP,s andh are those which can determine a feasible execution sequence for a realizable iteration space. All of them have to follow the precedences imposed by the data dependencies. It is well-known that the following properties are true 19]. the on-chip memory size, it is meaningless to have scheduling hyperplanes parallel to the partition boundary. Therefore, it is not possible that the partition vector and scheduling vectors are parallel.
Partition for a Single Dependence Vector
The purpose of the partitioning scheme is to guarantee that on-chip memory misses only happen to the nodes close to the boundaries of a partition. For a given on-chip memory size, we try to enlarge the partition size and reduce the misses for each partition, while maintaining the carrot-hole property by our data scheduling scheme. The equivalent problem is to nd the minimal necessary on-chip memory size for a given partition and schedule.
We begin with the special cases = (0; 1). In Figure 5 , we have an example of iteration space withd = (2; 3), where the nodes inside the rectangular frame do not depend on each other. This set of integral points has a signi cant importance on the partition scheme, we call it the period region according to the de nition below. The results generated by every node inside this region are not consumed and thus can not be eliminated from the on-chip memory until the next region begins to be executed. Therefore, the on-chip memory size must be large enough to hold data items computed in the whole period region. Since we are dealing with one dependence vector, the number of data items generated by the nodes in a period region is the same as the number of nodes in the region, denoted by node total , and it dominates the necessary on-chip memory size. This number increases linearly with the partition width. The number of data items produced inside the period region and consumed outside the partition is denoted by node out . Since these data items will not be referenced in the current partition, they should not be held in the on-chip memory according to our data scheduling rules, otherwise the carrot-hole property cannot be maintained. So, this number has to be deducted from the previous estimation of on-chip memory size. Before the nodes in the second period region begin to consume the data items generated in the rst region, data items acquired from the neighbor partition (nodes drawn as crosses in the gure), denoted by node extra , have to be located in the on-chip memory. From now on, we use the notationṽ ? to represent a vector orthogonal toṽ. It is easy to verify that for a given iteration space T and scheduling vectors = (s x ; s y ),
(1) If s y 6 = 0, the distance along the y-axis between two adjacent scheduling hyperplanes is (2) If s x 6 = 0, the distance along the x-axis between two adjacent scheduling hyperplanes is 1 jsxj :
In order to get the total number of nodes in a period region, we calculate the number of scheduling hyperplanes in a period region by PR(d) =d s, while the number of points in a hyperplane is given by the lemma below. if P ?
x 6 = 0, and bj b sy jc or dj b sy je if P ? x = 0 and s y 6 = 0.
Proof: Consider B 1 and B 2 the two boundaries of the partition. Assume that h 1 and h 2 are two scheduling hyperplanes bounded by the dependence vector. Without loss of generality, as shown in Figure 6 , the intersection of B2 and h 1 is D. Let the intersection of B1 and h 1 be (0; 0), and l be the horizontal distance from D to (0,0). By using the equations of straight lines for B 2 and h 1 , the distance l can be computed by l = The following theorem is used to determine the minimal necessary on-chip memory size MC(d). Proof: According to the carrot-hole data scheduling rules, the minimal necessary on-chip memory size is as large as to hold the data items necessary in the current partition. So from lemmas 4.2, 4.3 and 4.4, the proof is immediate. 2
Considering that the partition width is usually much larger than the value of the dependence vector components, the minimal necessary cache size is dominated by node total (d) when the schedule vector is not orthogonal to the dependence vector. In this case, the value is just the area of the period region PR(d).
Partition for Multiple Dependence Vectors
In this subsection we calculate the minimal necessary on-chip memory size for the partition with multiple dependence edges that belong to one vector equivalent class and for multiple equivalent classes. The minimal necessary on-chip memory size of an equivalent class is decided by the maximum node out and node extra of the dependence vectors in the same equivalent class as shown in the lemma below. Lemma 4.6 For a given iteration space T with multi-dimensional data ow graph G = (V; E; d), and 8e i 2 E;d(e i ) is in a same vector equivalent class U. For a given partition vector and partition width, the minimal necessary on-chip memory size MC(U), is given by MC(U) max e2E fnode total (d(e)) + node extra (d(e))g Proof: According to the carrot-hole data scheduling rules, the data item has to be located inside the on-chip memory until its last reference in the partition. So by theorem 4.5, the proof is immediate. 2
The previous lemma gives the minimal necessary on-chip memory size for an iteration space T with dependence vectors belonging to one vector equivalent class. We can get the minimal necessary on-chip memory size for multiple vector equivalent classes from the theorem below. Proof: Proven directly from lemma 4.6 and carrot-hole data scheduling rules. 2 
On-chip Memory Misses for the Whole Iteration Space
After partitioning, the total number of on-chip memory misses for the whole iteration space is the summation of the basic on-chip memory misses for each partition, which can be represented by:
Total misses = Basic misses per partition Number of partitions (1) Here we assume that the iteration space is large enough such that after partitioning, any irregular shape partitions near the boundaries of the iteration space can be neglected. The following lemma provides the formulation to calculate the number of total on-chip memory misses. Proof: Let's consider an arbitrary partition as shown in gure 7(a). P 1 and P 2 are two partition boundaries and ED is a line parallel to the x-axis representing an arbitrary row in the partition. We move the dependence vectord = (d x ; d y ) and let it point from the boundary of the partition to the point F on ED. From the geometric properties, the integral points in EF (point F excluded) are those nodes in this row which need the data from the neighbor partition. For a dependence vectord, and partition vectorP = (P x ; P y ), by using the same way in the proof of Lemma 4.1, we get jEFj = jd x ? d y Proof: According to the carrot-hole data scheduling rules, the data schedules for data items belonging to di erent vector equivalent classes are independent to each other. Due to this characteristic, it su ces to add up the basic on-chip memory misses for each class. 2
In a di erent way of other studies about tiling mechanisms, the areas of the period regions of dependence vectors which represent the minimal necessary on-chip memory size for a partition are bounded by the partition boundaries and scheduling hyperplanes. Theorem 4.11 provides a simple approach to get two potential optimal scheduling vectors for a twodimensional problem. Theorem 4.11 For a given realizable iteration space T with multi-dimensional data ow graph G = (V; E; d), the optimal scheduling vector must be orthogonal to one of the two outmost dependence vectors.
Proof: From Property 1 in Section 4, we know that the valid scheduling hyperplanes have to be outside or overlap the outmost dependence vectors. Considers to be the optimal scheduling vector for a given partition vectorP . According to Lemma 4.8, the number of basic on-chip memory misses is only related toP . Whens changes, only the partition width changes if the on-chip memory size C remains constant. Assumes does not overlap any vector orthogonal to the outmost dependence vectors. Thens must be outside of the set of dependence vectors. Consider a dependence vectord. The range bounded by the scheduling hyperplanes H 1 , H 2 and partition boundaries P 1 and P 2 is the period region ofd re ecting the minimal necessary on-chip memory size required byd as shown in Figure 8 . In the gure, segment OG is perpendicular to P 1 and P 2 , and so does segment AD. AD goes through the ending point of the dependence vectord and intersects with H 2 at C. The area of PR(d) is jOGj jACj.
Suppose we have another scheduling vector s 0 such that (s 0 ) ? overlaps with the lowest dependence vector. H 3 and H 4 are the boundaries for the area corresponding to the minimal necessary on-chip memory size to s 0 . Because s ? is outside of the set of dependence vectors, it is obvious that jACj > jABj. SinceP = (P x ; P y ) and the block width is b, we have C = jOGj jACj = jb P x jPj j jACj If we substitute AC by AB, b can be enlarged. Therefore, according to Lemma 4.8, 4.9 and Theorem 4.10, b will be larger than the case withs. So the number of total on-chip memory misses is smaller withs 0 , which contradicts the assumption thats is optimal. 2
After the two potential scheduling vectors are determined, we nd that the total number of on-chip memory misses becomes a function of the partition vectorP . The correct choice of the partition boundaries has been extensively studied by the researchers working in tiling mechanisms 3, 16, 26] . In 15] and 30], techniques were developed for determining parti- tions that would allow a communication free solution. However, due to the con guration of the dependence vectors in more general cases, such as those used in our experiments, such techniques would result in solutions containing only one partition. In this study, we target the improvement of the execution time by reducing the communication overhead. The communication free foundations implicitly impose limitations to the solution that would not satisfy our requirements. Di erent of some concepts on tile mechanisms, we assume that communication will occur between partitions as soon as data is ready to be transmitted. From 6] we obtain the theoretical support to determine the optimal partition vector. We reformulate their solution in a simpli ed format according to the Theorem below. Theorem 4.12 For a given realizable iteration space T with multi-dimensional data ow graph G = (V; E; d), the optimal partition vector is orthogonal to one of the outmost dependence vectors.
Proof: From the discussion in Section 4.1, we know that the scheduling vector and partition vector are not parallel to each other. Consider B 1 and B 2 the partition boundaries which are parallel to one of the outmost dependence edges. Assume that B 0 1 and B 0 2 are two alternative partition boundaries outside the outmost dependencies as shown in Figure  9 . The points e and a are the starting and ending point of an arbitrary dependence vector d. H 1 and H 2 are optimal scheduling hyperplanes which bound the period region area ofd. 
Multi-level Partitioning
From the discussion in the previous section, we know that the local memory is one of the important factors a ecting the performance of an uniprocessor system. A novel partitioning method, multi-level partitioning method, which takes both processor and local memory constraints into consideration is discussed in this section.
Framework of Multi-level Partition
The basic idea of a multi-level partition framework is to combine the partition by local memory and the partition by processors together. Let us consider K p the number of processors, and K c the number of partitions by local memory. The iteration space is statically partitioned into a set of parallel partitions P 1 , P 2 , : : :, P Kc by local memory. The partitions are in order in the iteration space and P i does not depend on the data items generated in P j if i < j. The processors (PEs) u 1 , u 2 , : : :, u Kp are assigned to the partitions by the following rules.
1. At any time, one PE executes at most one node in the iteration space.
2. All the nodes in one partition are executed by the same processor as in the LSGP approach.
3. Partition P i is executed by processor u j , where j = i mod K p .
In our method, the execution of these partitions can be done in parallel by applying the pipeline technique. A number of consecutive partitions, which are executed by di erent processors and can overlap their execution time, is called a cluster. The model of the communication and data transmission between the processors is formalized as: Inside any partition, the carrot-hole data scheduling rules are applied.
From the implementation characteristics, we assume that is larger than . The reason for such assumption is based on the fact that communication always imply an access to external memory to guarantee the cluster reutilization, while the communication may consist of a direct communication between processors. The summation of the total number of and communications is equivalent to the total misses described in the previous section. The number of either or communication between two partitions is equivalent to the basic misses per partition.
Equal-size Partition
The equal size partition scheme is an easy way to combine partitions by PE and partitions by local memory to get the nal partitions. Initially, all the partitions have the same partition width.
If K c < K p , partitions by PE are chosen as the nal partitions. If K c = K p , the two kinds of partitions are the same and there is only one cluster. If K c > K p , we partition the iteration space into K c partitions and d Kc Kp e clusters. Each cluster has K p partitions. Inside one cluster, pipelining is applied and the processors can execute simultaneously, while the clusters are executed sequentially. Each partition satis es the requirement of local memory of the assigned PE. We can see that after applying the pipeline technique on both, clusters and the partitions, inside a cluster all processors are kept busy almost all the time. In some cases, when K c > K p , K c may not be the multiple of K p , then we choose the nal partition inside the cluster to be smaller than the size of the original partition by local memory so that the number of nal partitions is a multiple of K p and closest to K c . The earliest starting time of a partition is the earliest time that the processor in charge of this partition can start working so that through its execution it does not need to stop and wait for the data coming from the neighbor partition. The earliest starting time of partition P i can be established as:
1. if t(n p ) + > t(n c ), starting time of the rst node in partition P i is (i ? 1) (t(n p ) + ? t(n c )).
2. if t(n p ) + < t(n c ), starting time of the rst node is time 0. Here, n c is the rst node in the partition P i which depends on data coming from the previous partition, and n p is the node generating the data needed by n c . t(n c ) and t(n p ) are the number of time units measured since the rst node in their partition began to execute. The time between the earliest starting time of the processor in charge of the rst partition and the earliest starting time of the processor in charge of the last partition in a cluster is the time required before all processors can run in parallel. We call this time pipeline ll-up time.
The correctness can be shown by the regularity of the iteration space. In partition P i , the processor starts the execution at time t = (i ? 1)(pipeline ll-up time) and at t = (i ? 2)(pipeline ll-up time) for partition P i?1 . The starting time di erence is the pipeline ll-up time, which guarantees for node n c and the nodes executed before n c that the processor does not need to stop and wait for the previous partition to send the data. If node i 1 in partition P i?1 generates data for the use of node j 1 in partition P i , and node i 2 generates data for node j 2 . The dependencies are presented by the same dependence vector in the vector equivalent class, then t(i 2 ) ? t(i 1 ) = t(j 2 ) ? t(j 1 ). Hence t(i 2 ) ? t(j 2 ) = t(i 1 ) ? t(j 1 ) t(n c ) ? t(n p ). The processor, therefore, does not need to stop and wait for the data from the neighbor partition. Notice that due to the choice of the partition vector, according to theorem 4.12, no cycles exist involving di erent partitions. Several data items require either or communication among the nodes on one scheduling hyperplane or several scheduling hyperplanes. In a simple example shown in Figure 10 , the timing of data passing between processors is illustrated. We notice that if 6 = , there will be empty slots in the middle of the execution because the processor has to wait for the data in order to keep the same pace for pipelining. This problem is solved by the varied-size partition method discussed in the next subsection.
Varied-size Partition
If is strictly greater than , the processor which is not executing the partition in the cluster boundary has to waste some time in order to keep the same progress speed as those processors which have to deal with communication. We formalize the execution time for equal-size partition as follows. Let us consider an iteration space T with size M N, where the execution time for the nodes, in one scheduling hyperplane in one partition, is t E and the time for communication and communication are and respectively. If there are K p processors and J clusters in the system and the partition width is b, and J is large, we can ignore the pipeline ll-up time in each cluster, and the total time for completing the execution of the iteration space is JN(t E + maxf ; g).
The varied-size partitioning method eliminates the empty slots by shrinking the width of the boundary partitions, in order to reduce the execution time between two communications for the processor in charge of the cluster boundary partitions. The size of the cluster boundary partition is decided by mb 2 + = mb 1 +
where m is the execution time of an individual node, b 2 is the width of the cluster boundary partition and b 1 is the width of the normal partition. When we pick a smaller width for the boundary partition, the minimal necessary local memory size (corresponding to the minimal necessary on-chip memory size in the uniprocessor partition ) for the smaller partition is less than the local memory constraint. The available local memory for those processors executing the boundary partitions will not be utilized 100%. So the number of partitions in the varied size partition scheme is very likely to be more than the number of partitions in equal-size partition scheme for the same problem. Thus, the number of clusters will be more than that of equal-size partitions and the number of communications will be increased. However, the empty slots are eliminated. Note that the change in the partition widths do not change the partition vector and therefore the increase in the communication rate is proportional to the increase in the number of clusters. Usually, the existing partitioning methods try to reduce the communication overhead by sacri cing the processor time. This is a common situation where the execution time of one iteration is much smaller than the communication time, such as in network of workstations. In this paper, we assumed that the target systems are tightly coupled. Such architecture is commonly used in the design of application speci c integrated circuits and are becoming a trend in computer architecture 18]. Therefore, the communication channels are dedicated and may be used at full load, allowing the designer to optimize the overall performance by correctly balancing the use of the processors.
Results
In this section we present the simulated results of the application of our method to practical cases introduced in 10, 24] . We compare the resulting performance of the equal-size partition method and the simple LSGP method (which partitions the problem only by the number of processors) and also with a scheme which uses FIFO or LRU cache replacement strategy in the execution of one processor. The iteration space size is assumed to be 1000 1000.
For a rst set of experiments, we assume that the number of processors is the same as the number of partitions by local memory. Therefore, only communications are considered. The experiments consist of simulations with uniform nested loops benchmarks from 10], wave digital lters used to solve transmission line problems (WDF) and two-dimensional in nite impulse response lters (IIR) from 24]. Table 1 shows the results of such simulations. The rst column de ne the problem being solved, the second column speci es the local memory assumed, column partition number lists the number of partitions obtained by the multi-level partitioning technique and is used in all the cache replacement strategies in the experiment, columns LRU, FIFO and carrot-hole (C-H) present the number of external Table 2 : Equal-size partition and varied-size partition.
memory accesses required when LRU, FIFO replacement strategies and our carrot-hole data scheduling method are used. The last two columns present the ratio between the LRU(FIFO) and the carrot-hole results. We can see that for the nested loop 1, the improvement of the carrot-hole data scheduling over the LRU method is 15 times with 16 partitions when the local memory size is 1000. In the best result, the gain is of 743 over the FIFO approach. In another set of experiments, a multi-processor environment is used as the target system allowing the comparison of using equal-size and varied-size partitioning schemes. Table 2 summarizes the results of such tests for the example of nested loop 1 with an iteration space size 10000 10000, using 500 as the local cache size. The time of communication for all data items is assumed to be 1 unit, while the communication time is 5 units. The time for a PE to execute a number of nodes in a row inside the partition by local memory is 10 units. The table presents the number of processors used, how many clusters are necessary, the number of and communication and the total execution time by 1000 units including communication and execution. In these examples, the varied-size partition can achieve an average of 14% reduction of the total time compared with the equal-size partitioning method.
Conclusion
Most of the previous work on the partitioning problem did not consider the e ects of memory hierarchy and data allocation in the performance of the nal design. However, in the data and computation intensive DSP applications, memory access overhead plays an important role in the system performance. The data scheduling in memory hierarchy can not be separated from the partitioning problem. In this paper, a static data scheduling method carrot-hole data scheduling, as well as a multi-level partitioning scheme were presented to control the data tra c inside the local memory of each processor and the data transfer between the processors. This approach makes the system fully utilize the local memory during the execution of each individual processor by holding the carrot-hole property, and minimize the communication overhead by selecting optimal partitions. Equal-size and variedsize partition schemes were discussed. Varied-size partition eliminated the empty slots in the execution and made pipelining perform better. Experiments show that by using this partition scheme, local memory misses and communication are signi cantly reduced as compared to results obtained from methods without a data scheduling technique resulting in important improvement rates of hundreds times. The results for equal-size and varied-size partition were also compared, showing a advantage on the usage of the varied-size partition method.
