This paper presents a novel optimization technique for the design of application speci c integrated circuits dedicated to perform iterative or recursive time-critical sections of multi-dimensional problems, such as image processing applications. These sections are modeled as cyclic multidimensional data ow graphs (MDFGs). This new optimization technique, called multi-dimensional interleaving consists of a multi-dimensional expansion and compression of the iteration space, followed by a multi-dimensional retiming, while considering memory requirements. It guarantees that all functional elements of a circuitry can be executed simultaneously, and no additional memory queues proportional to the problem size are required. The algorithm runs optimally in O(jEj) time, where E is the set of edges of the MDFG representing the circuit. Our experiments show that the additional memory requirement is signi cantly less than the results obtained in other methods.
Introduction
The design of Application Speci c Integrated Circuits (ASICs) is usually required in order to improve the execution performance of computation-intensive applications. A large group of such applications consists of multi-dimensional problems, i.e., problems involving more than one dimension, such as computer vision, high-de nition television, medical imaging, and remote sensing. An important characteristic of these problems is that they contain time-critical sections consisting of iterative or recursive execution of sets of operations also known as loops. It is well known that a parallel implementation of such operations would improve the performance of the ASIC design.
Most of the previous research on loop parallelization has focused on one-dimensional problems 3, 4, 8, 11, 13, 15, 19, 27] . However, the performance improvement achievable by those methods is constrained by the number of delays (registers) in a cyclic data path. In this paper, we model the loops, or iterations, as multi-dimensional ow graphs (MDFGs). A novel multi-dimensional transformation applicable to such MDFGs is able to obtain the desired high level performance, while a special consideration is given to the consequences of such transformations on the memory requirements. The multi-dimensional characteristic of the problems is the foundation for the high parallelism achievable, usually superior to results obtained through traditional methods based on one-dimensional techniques.
Recent studies have considered the optimization of nested loops, a software point of view of the multi-dimensional problems 1, 2, 6, 18, 29, 30] . In general, these methods transform the loops in such a way to obtain a new sequence of execution characterized by a higher parallelism. This sequence of execution is commonly associated with a schedule vector, also called an ordering vector, that a ects the order in which the iterations are performed. In the area of high-level synthesis, researchers have focused on the optimization of multi-dimensional problems through the selection of an appropriate schedule vector 16, 22, 24, 25] . The new schedule vector usually di ers from the one used in the original design, introducing new memory requirements that may end up in complex storage control and substantial increment on the memory size.
In a previous study, it has been shown that full parallelism can be obtained by the application of multi-dimensional retiming techniques 23]. However, that method required the use of additional memory queues 2 proportional to the size of the problem. In this study, we introduce an algorithm that optimizes the circuit design through the application of a multi-dimensional transformation technique that incorporates the multi-dimensional retiming, improving the parallelism by restructuring the iteration space and the loop body without changing the original schedule vector, and consequently, not requiring additional queues. The dimensions of the problem are associated with the axis of a cartesian space, where the integral points represent each iteration of the loop. In a twodimensional (2-D) problem, such axis are also known as row and columns. If the system is originally scheduled to be computed row by row, then operations executed sequentially within one iteration, due to data dependencies, require an unfolding or retiming technique to be parallelized 5, 15, 20] . The retiming limitations are well-known for one-dimensional problems. In the circuit design, data dependencies between neighbor points in the row direction are translated into one register in the 2 The term queue will be used throughout this paper for queues with size dependent on the problem. Leiserson and Saxe proposed a slow-down technique 15] in order to increase the number of delays in the circuit, and to achieve the desired retiming, so called systolic design. However, this technique also reduces the circuit performance. In this paper, an expansion of the iteration space is applied to improve the potential of parallelism, while a compression avoids any loss of performance. The combination of the expansion and compression of the iteration space is the basis for our optimization technique called multi-dimensional interleaving. A maximum throughput of one result per cycle time 3 is achievable after the multi-dimensional interleaving is applied.
For simplicity we use a two-dimensional problem aimed to a single processor system as an example. It consists of a two-dimensional lter 9], represented by the transfer function: H(z 1 ; z 2 ) = where w = 1, c(n 1 ; n 2 ) = :5, 8n 1 ; n 2 , which can be translated into y(n 1 ; n 2 ) = x(n 1 ; n 2 ) + P 1 k 1 =0 P 1 k 2 =0 :5 y(n 1 ? k 1 ; n 2 ? k 2 ) , for k 1 ; k 2 6 = 0; 1 or y(n 1 ; n 2 ) = x(n 1 ; n 2 ) + :5 (y(n 1 ? 1; n 2 ) + y(n 1 ; n 2 ? 1) + y(n 1 ? 1; n 2 ? 1)) Figure 1 (a) shows a segment of program where four statements labeled A, B, C and D are computed inside of a doubly nested loop. This code is equivalent to the simulation of the lter described above. A multi-dimensional data ow graph representing this problem is shown in gure 1(b). Nodes represent operations and edges represent data dependencies. The labels on the edges indicate the distance between iterations. Figure 1 (c) shows a circuit design implementing the solution for this problem. In this example, we can observe the one-dimensional retiming constraint. For a row-wise execution, as established by the loop control variables, a register is placed between D and B to store data for the dependence (1; 0). If we assume the problem size to be M M 3 The cycle time is assumed to be the longest execution time among all operations in the loop body points, a queue of size M is used for the dependence (0; 1), and one of size M + 1 is used for the dependence (1; 1). It is easy to notice that B, C and D can not be executed in parallel by applying one-dimensional retiming techniques, due to the number of delays in the lower cycle. Using multi-dimensional retiming techniques, it is possible to overcome this problem by selecting a new schedule vector such as (1; 1). However, this solution would require three queues of variable size with a maximum length M.
By applying our new technique, a combined expansion of the row-wise recursion and the compression of every three consecutive rows, we obtain the design shown in gure 2. In (a), we see the new MDFG with some additional edges. The new pairs of edges are translated into data paths containing multiplexors and demultiplexors as seen in gure 2(b). The reason for such extra edges is easily explained by examining the original iteration space in gure 2(c), compared to gure 2(d) where the iteration space was expanded in the row-direction and compressed in the column direction. The movement of iteration I2 to position (4; 0) implies that the original dependence (0; 1) from I1 to I2 is now (4; 0) as well as the edge between I2 and I3, while the dependence between I3 and I4 becomes (?8; 1). Intuitively, we notice that if I2 had been moved to iteration (1; 0), the retiming of the upper cycle in order to obtain full parallelism would not be possible. Proceeding with the optimization process, a multi-dimensional retiming applied to this new MDFG results the desired fully parallel solution shown in gure 3. In this solution, we can observe that A, B, C and D can be executed in parallel. Multiplexors and demultiplexors are added to the design to obtain the fully parallel solution. The memory elements still have the same characteristics, i.e., only two queues and all other elements with xed size not dependent on the problem. To evaluate the optimization achieved in this simple example, let us assume that this lter would be applied to a two-dimensional image of 512 512 pixels. Considering the design implemented through the usage of standard CMOS cells where the multiplier has an execution time of 40 ns and the adder 20 ns, the total computation time for the original design would be approximately 26.2 ms. After the optimization, the total computation time is reduced to approximately 10.4 ms, which implies a gain of 60 % in the circuit performance. The total number of queues required by a fully parallel solution obtained by using previously developed multi-dimensional retiming techniques would be at least three, while as seen in the example, our optimized design requires only two queues. In a more general lter, covering a window of size (w + 1) (w+ 1), the number of additional queues required by the multi-dimensional retiming methods would be O(w 2 ), while in our proposed method is zero.
The remaining of this paper presents the novel multi-dimensional interleaving technique beginning by showing basic concepts and terminology in Section 2. In Section 3, we discuss the properties necessary to have an expansion of the iteration space in order to improve the potential for a fully parallel solution. Section 4 shows how to combine multiple rows of the iteration space to avoid any performance loss due to the expansion. Section 5 presents the main algorithm, showing how it a ects the memory elements in the circuit. Follows examples of application of our technique. A nal conclusion summarizes the concepts introduced in this paper.
Background

Modeling Multi-Dimensional Problems
In this section we present some basic concepts and de nitions in the interpretation of multidimensional problems, and how they relate to the circuit design. We begin by describing a multidimensional data ow graph.
A multi-dimensional data ow graph (MDFG) G = (V; E; d; t) is a node-weighted and edge- Figure 4 (b) shows a representation of the iteration space for the MDFG presented in gure 4(a). For simplicity, we will always show a small section of the iteration space with respect to our examples. Figure 4 (c) is a magni cation of the nodes in the iteration space in such a way that we can see the internal operations of each iteration. Notice that these gures of the iteration space will be used along the paper for illustration only.
We use the notation u e ?! v to indicate that e is an edge from node u to node v. The notation u p ; v means that p is a path from u to v. The To manipulate MDFG characteristics represented on vector notation, such as the delay vectors, we make use of component-wise vector operations. Considering the two-dimensional vectors P and Q, represented by their coordinates (P:x; P:y) and (Q:x; Q:y), examples of arithmetic operations are P + Q = (P:x + Q:x; P:y + Q:y) and P Q = (P:x Q:x; P:y Q:y). The notation P Q indicates the inner product between P and Q, i.e., P Q = P:x Q:x + P:y Q:y. Vectors are ordered in a right-to-left lexicographic order, i.e., for two n-dimensional vectors P = (P 1 ; P 2 ; P 3 ; : : :; P n ) and Q = (Q 1 ; Q 2 ; Q 3 ; : : :; Q n ), P < Q, if for some 1 i n, P i < Q i ; and 8j > i; P j = Q j . For example, (1; 0; 0) < (0; ?2; 1) < (0; 1; 1). Vectors are also used to indicate the sequence of computation. In this paper, the sequence of execution is de ned as a row-wise computation, i.e, iterationî is executed before iterationĵ ifî <ĵ. An illegal multi-dimensional retiming occurs if the resulting iteration space presents cycles. Previous studies have proposed di erent methods on how to nd a legal multi-dimensional retiming 21, 22, 23] . In this paper, by using multi-dimensional interleaving, we will show that the multidimensional retiming (1; 0; : : :; 0) is su ce as the basic retiming function to achieve full parallelism.
The next section will show how the MDFG will be transformed to accommodate this retiming function without producing a cycle.
Expansion of the Iteration Space
Let us examine again the example in gure 4. We see that the data produced by A is immediately consumed by B. This condition implies a mandatory serial execution of these two operations. In order to improve the circuit performance, both operations should be executed in parallel. The simultaneous execution of those two functions requires a register or latch device between them. A well-known solution for this problem is to retime node A by one, i.e., to remove one register of each incoming edge (data path) of A and push it to the outgoing edges. This is called by us the traditional one-dimensional retiming. It is easy to verify that such a transformation does not solve the problem, The expansion of the iteration space by the application of the expander function results in a new set of iterations larger than the existing number of points in the original problem. Therefore, some of the new iterations are not necessary, and we will consider them as if no computation was taking place at all. We say that these new iterations are empty or inactive, while the original ones are considered valid or active. For instance, if we apply the expander function with coe cient f e = (2; 1) to the example in gure 4, we will obtain the MDFG presented in gure 6(a) with an equivalent iteration space shown in gure 6(b). In this new iteration space, (0; 0) and (0; 1) are examples of valid iterations, while (1; 0) and (1; 1) are empty. Intuitively, one can notice that the size of the memory elements in the corresponding design has doubled.
To compute the size of each of the memory elements associated with the dependence vectors, we begin by de ning a vector that represents the number of points in each of the dimensions of the problem. We also de ne an auxiliary function called linearization.
De nition 3.2 Given an MDFG G representing an n-dimensional problem, and a set of values S i ; 1 i n, where S i represents the number of points in the i th dimension of the problem, the size vector S is given by S = (S 1 ; S 2 ; S 3 ; : : :; S n ).
De nition 3.3 Given an MDFG G representing an n-dimensional problem, and its size vector S, the linearization function L, from Z n to Z n , computes the vector L(S) = (1; S 1 ; S 1 S 2 ; S 1 S 2 S 3 ; : : :; Proof: According to the multi-dimensional retiming concepts, to have all edges non-zero implies in distributing delays (1; 0; 0; : : :; 0) among all edges in the graph. Therefore, the summation of the memory elements in any cycle l must have size greater than or equal to the length of the cycle, i.e., d e (l) (k; 0; 0; : : :; 0). From lemma 3.1, M de = d e (l) L(S e ) = d e 1 (l) + d e 2 (l) S e 1 + : : :, therefore, if d e i (l) > 0; i > 1 or d e 1 (l) k there will be enough delays for the target retiming. 2
However, the expander function has the side e ect of lowering the e ciency of the circuit by introducing empty iterations. Therefore, the real fully parallel solution, i.e., all operations executed in parallel per valid iteration, has not been achieved, besides the latches required in each edge for a legal multi-dimensional retiming had been provided. In the next section, we show how to compensate such a loss by compressing the iteration space into a new form, while keeping the same global schedule vector. The iteration space will contain several target rows. One is particularly important for our method, the one that is used as the rst target. Without loss of generality, we assume that such a row is the row zero, and that the lowest column index is also zero. Figure 8 presents a migration operation applied to the expanded iteration space shown in gure 6. In this example, the original iteration (0; 1) is being moved to the new iteration (1; 0); however, the result is not useful for our further optimization because some of the original dependencies in the vertical direction became dependencies in the row-direction in the transformed space and will restrict the application of the multi-dimensional retiming, i.e., the dependence (0; 1) became (1; 0) and (?1; 2), and the new (1; 0) is a retiming restriction to the upper cycle of the MDFG. Another possible problem would happen if there was a dependence (?1; 1) that after compressed would become (?1; 0), creating a cycle in the iteration space, and con icting with the row-wise execution sequence.
position of some iteration after the expansion and migration stages. Lemma 4.1 Given an MDFG G = (V; E; d; t), submitted to an expander function with coe cient f e = (f; 1; 1; : : :; 1), and a migration vector c = (g; ?1; 0; 0; : : :; 0), an iteration P = (p 1 ; p 2 ; : : :; p n ), is expanded and moved to a target row in position f e P + mod(p 2 ; f) c.
Proof: After the expansion, a given iteration P becomes f e P. This new iteration is then moved to a target row f, 0, integer, according to the direction of the migration vector c. Since the distance between the row that contains P and the target row is 0 p 2 ? f < f, then f p 2 < f + f, therefore P must be moved down by mod(p 2 ; f). Since the migration vector is applied to every coordinate of P, then the new position of iteration P is f e P + mod(p 2 ; f) c. 2
Intuitively, we know that for a target row h, we have the heads of ERCs at positions ( f; h; x 3 ; : : :; x n ); 8x k ; 3 k n and 0. After migrating the rst iteration from row h + 1 to the rst position after the head of an ERC at row h, we can determine the origin of the remaining f ? 2 iterations through the next lemma. Proof: From lemma 4.1, the original iteration P is moved to f e P + mod(1; f) c = (f p 1 ; h + 1; p 3 ; : : :; p n ) + (g; ?1; 0; 0; : : :; 0) = (f p 1 + g; h; p 3 ; : : :; p n ) = ( f + 1; h; p 3 ; : : :; p n ). From this relation, we can determine g as g = f + 1 ? f p 1 (1) . Now, considering the point Q = (q 1 ; q 2 ; q 3 ; : : :; q n ), such that q 2 = h + , for 0 < < f, from lemma 4.1, the migration operation maps the expanded iteration f e Q to f e Q + mod(q 2 ; f) c, i.e, (f q 1 ; q 2 ; q 3 ; : : :; q n )+( g; ? ; 0; 0; : : :; 0) = (f q 1 + g; q 2 ? ; q 3 ; : : :; q n ) = ( f + ; h; p 3 ; : : :; p n ). Thus, q k = p k for k 3, and g = f+ ?f q 1 (2) . Combining (1) and (2), f+ ?f q 1 = f+1?f p 1 , therefore, q 1 = (1 ? ) + p 1 . 2
In the example of gure 8, since we want to bring down iteration (0; 1) to position (1; 0), the migration vector is c = (1; ?1). Figure 9 (a) shows a di erent example, where a two-dimensional iteration space with the solely dependence (0; 1) is expanded by f e = (3; 1) and has its rows migrated according to c = (4; ?1). Figure 9(b) shows the resulting iteration space after the migration operation. At this point, we will ignore any changes in the coordinates of the second row of the resulting iteration space, which will be explained later in the column compression stage. In this example, the point (1; 1) has been mapped to position (7; 0), i.e., (3; 1) (1; 1) + (4; ?1) = (7; 0). From lemma 4.2, we can identify the third iteration Q 0 in that same ERC ( = 2, since the ERC begins at iteration (6; 0)) as expanded and migrated from Q = ((1 ? 2) 2 + 2 1; 2) = (0; 2), which can be veri ed in the gure. When iterations are moved to new positions, their incoming dependencies change accordingly. The next lemma shows what happens to dependencies between the iterations in a target row h and those that migrated to h. It was already mentioned that two valid iterations can not be expanded and migrate to a same position. We also know that the nal iteration space can not have dependencies contradicting the execution sequence, i.e., iterations depending on data to be produced in future iterations. For example, let us consider the dependence d = (?2; 1) in gure 9(c). Using the same expansion and migration applied to gure 9(a), this dependence would become d m = (?2; 0) as seen in gure 9(d) which con icts with the sequence of execution, and also creates a cycle with the transformed dependence (4; 0). This implies that we need to de ne a legal migration vector, which is done as follows:
De nition 4. Choosing g = 1, by lemma 4.1, an iteration P = (p 1 ; h + ; p 3 ; : : :; p n ), with 0 < < f is expanded and mapped to a target row h at position P 0 = f e P + c. Since 0 < < f, and knowing that valid iterations I = (i 1 ; h; i 3 ; : : :; i n ) on row h are located in positions f e I due to the expander function, then f e P + c must be in between two consecutive valid iterations originally from row h, i.e., between (f i 1 ; h; i 3 ; : : :; i n ) and (f (i 1 + 1); h; i 3 ; : : :; i n ) (1). We know that f p 1 < f p 1 + < f p 1 + f, or f p 1 < f p 1 + < f (p 1 + 1). Therefore P 0 is located between two consecutive valid iterations where p 1 = i 1 , which is the position of an empty iteration.
For part(b), we must make sure that for any incoming dependence d(e) = (d 1 ; d 2 ; 0; : : :; 0) to P = (p 1 ; h+ ; p 3 ; : : :; p n ), with 0 < < f, originated in row h, the transformed dependence d m after expansion and migration will be greater than or equal to (1; 0; : : :; 0), according to de nition 4.4. From lemma 4.1, P is mapped to P 0 = f e P + c = (f p 1 + g; h; p 3 ; : : :; p n ). Replacing g, P 0 = (f p 1 + ( f ? f p + 1); h; p 3 ; : : :; p n ) = (f (p 1 ? p + ) + ; h; p 3 ; : : :; p n ). Therefore, (f (p 1 ? p + ); h; p 3 ; : : :; p n ) < P 0 < (f (p 1 ? p + + 1); h; p 3 ; : : :; p n ).
Using (1) we conclude that P 0 is inside an ERC, and according to lemma 4.2, it does not con icts with any other valid iteration. 2
We notice that for an expander coe cient f e = (f; 1; 1; : : :; 1), after migration, there will be f ? 1 rows without any valid iteration in between the target rows. Since these rows are not useful for the desired computation, we remove them by a column compression de ned below:
De nition 4.5 Given an MDFG G = (V; E; d; t), submitted to an expander coe cient f e = (f; 1; 1; : : :; 1), and a migration vector c, producing G 0 , a column compression removes all nontarget rows in the iteration space for G 0 . As a result of the column compression, an iteration P = (p 1 ; h; p 3 ; : : :; p n ) in a target row h = f, 0, in G 0 , will have a new index P 0 = (p 1 ; ; p 3 ; : : :; p n ).
Examining again the examples in gure 9, we notice that after migration and column compression, the target rows 0; 3; : : : became 0; 1; : : :, and also that there is a non-regularity on the dependence vectors at every 3 iterations in the row direction. In gure 9(b), we have the new dependence (4; 0) for the the rst two iterations in the ERC (i.e., (6; 0) and (7; 0)), and (?8; 1) for the last iteration (i.e., (8; 0)). To model this non-regularity, we introduce the concept of an MDFG associated with an activation mechanism that will provide the necessary information on when to use the dependence vectors. We call this new model a Multiplexed Multi-Dimensional Flow Graph (MMDFG). The two de nitions below are used to characterize the MMDFG.
De nition 4. From theorem 4.5, we notice that the new set of dependencies in the row-direction, created by the transformation of vertical dependencies, may a ect the nal multi-dimensional retiming by introducing new restrictions on the number of delays in a cycle. In gure 10, the original dependence (0; 1) produced the new dependence (4; 0) in a cycle containing exactly 4 edges, which implies that the new dependence satis es the desired retiming. However, this is not always true, therefore, in order to adjust our solution to verify this new problem, we need to rewrite theorem 3.3 to accommodate this new model. Considering these new concepts, in the next section we introduce the algorithm that will produce the nal parallel solution using, as input data, the MDFG describing the problem.
The Multi-Dimensional Interleaving Algorithm
As denoted by theorems 3.3 and 4.6, the correct choice of the expander coe cient and the migration vector is the key for obtaining the multi-dimensional retiming in the row direction such that all operations can be executed in parallel. We also know that a modi ed single-source shortest path algorithm measuring the one-dimensional distance of all nodes in the graph to an arbitrary inserted node is one approach to verify the non-existence of negative cycles while computing the retiming function 7, 15, 26] . In our study, we reduce the problem of nding the expander coe cient, the migration vector and the retiming function to an algorithm based on a modi ed topological sort of the MDFG combined with the formulation developed along the previous sections. In order to obtain such an algorithm, we de ne a Multi-Dimensional Interleaving transformation.
De nition 5.1 A multi-dimensional interleaving transformation changes the structure of the iteration space represented by an MDFG G producing an MMDFG G m which has its row-wise dimension expanded by a coe cient f and its column-wise dimension compressed by f.
The combination of multi-dimensional interleaving and multi-dimensional retiming allows us to obtain fully parallel solutions for the multi-dimensional problems, without changing the initial global execution schedule. The algorithm applying such a combination is called MdIntRet In the rst step, a modi ed topological sort is used to order the nodes in levels, stored in the vector LEV , in O(jEj) time. Every node is assigned to a unique negative level number equivalent to the results obtained by the single-source shortest path algorithm if each non-zero delay edge had in nite delays. Each node is labeled according to its level number, which produces a monotonically decreasing characteristic of the node indices in any path in the graph. The absolute value of the di erence between any two nodes connected by multiple zero-delay paths b i , where i 1, represents the length of the longest zero-delay path between those two nodes. This characteristic produces the coe cients for the multi-dimensional retiming function. As mentioned before, the expander coe cient must be chosen large enough in order to allow a fully parallel solution after retiming the cycles with delays in the row-direction. Later, the migration vector must be chosen such that no cycles are created in the row direction and the transformation of dependencies to the row direction does not interfere with the desired retiming function. In order to satisfy such conditions, the formulation used for computing the expander coe cient f e and the migration vector c is obtained from the theorem below. ko . Combining with theorems 3.3 and 4.6, and the four cases above, we conclude that there is a multi-dimensional retiming producing a fully parallel solution. 2
The next section illustrates the use of our algorithm in two multi-dimensional problems.
Examples
In this section we present the application of our method to multi-dimensional applications commonly found in the image processing area. The rst example consists of a two-dimensional di erential pulse-code modulation (DPCM) device, used in image data compression 12]. It can be represented by the equations: computing the rst iteration to be migrated, we nd p = 0, i.e., the expanded iteration (0; 1) is expected to move to the empty iteration ( f + 1; 0). The value of is computed according to the edge L ! K, and results 1, implying g = 6. Therefore, the migration vector is c = (6; ?1). The dependencies are then transformed as shown in table 1. If we assume the circuit design implemented by using CMOS standard cell technology, available in Mentor Graphics CAD tools 17], the multiplier will execute in 40 ns while the adder requires only 20 ns. Considering such execution times, the nal fully parallel graph and the modi ed circuit design improve the execution time obtained in 10], from 60 ns to 40ns. Assuming also that the number of computational points is M M, and comparing our results with those obtained by using the chained MD retiming proposed in 23], we notice that both are able to achieve the fully parallel solution, however, the memory requirement has been reduced from 16 queues in 23] to 12 queues in this paper. Q when compared to other methods that require a new schedule vector. Therefore, it is clear that the multi-dimensional interleaving and retiming is an e cient optimization tool to reduce the execution time of a multidimensional application, while requiring minor memory adjustments.
Conclusion
We have presented a novel technique for optimizing a circuit designed to compute multi-dimensional applications. The circuit was initially represented by a multi-dimensional data ow graph and it was transformed through the use of a multi-dimensional interleaving method that we developed. This technique combined with a multi-dimensional retiming is able to achieve full parallelism in a single processor design, i.e., simultaneous execution of all operations (nodes) of the multi-dimensional data ow graph, maintaining the original schedule vector. The transformation applied to the circuit does not require additional queues dependent on the problem size as usually occurs with other methods.
The complexity of this new algorithm is O(jEj) time, where E is the set of edges of the multidimensional data ow graph representing the problem. The description of the multi-dimensional interleaving technique and how to compute the changes in the memory structure were presented. Examples of image processing problems were used to demonstrate the signi cant improvements in execution time and memory requirements obtained by our method when compared to results from other known techniques.
