Abstruct-Time-critical sections of multidimensional applications, such as image processing antd computational fluid dynamics are in general iterative or recursive. Most of these applications require each iiteration to he executed under a specific time constraint associated with the data input rate. The design of circuits dedicated to perform such repetitive tasks depend on optimization techniques to achieve the desired execution time. The retiming technique is one of these optimization tools; however the traditional retiming deals only with one dimension of the problem and has lower hound constraints in the execution time due to characteristics of the initial design. This paper presents a novel Optimization technique based on the application of a multidimensional retiming. Multidimiensional retiming improves the circuitry performance by inserting a fixed number of registers, which is independent of the size of the problem, into the circuit paths, and restructuring the memory elements in a legal way. This technique guarantees that all functional elements can be executed simultaneously on circuits designed to solve problems involving more than one dimensilon. Experiments show that the additional elements required for the performance improvement have a small impact on the circuit area.
I. INTRODUCTION 0MPUT.ATION-INTENSIVE applications usually de-
C pend on time critical sections, consisting of loops of sequential operations also called iterations. The design of application specific circuitry for performing such critical sections is a commonly used way to improve the computing time. Optimiz,ation techniques are used to try to satisfy time constraints found in the design specifications. An algorithm based on a multidimensional retiming transformation is presented in this study to guarantee the achievement of the execution time requirements by producing a circuit capable of executing all .its operations in plarallel. The multidimensional characteristics of the problem are the foundation for the high parallelism obtainable, usually superior to results produced by traditional methods based on one-dimensional retiming that ignore the advantages of multidimensionality.
Retiming was initially proposl-d by Leiserson and Saxe [ 1 11 focusing on one-dimensional problems. Retiming regroups the operations in each iteration to improve the parallelism. It is usually applied to a preliminary circuit design. This approach can restrict the: optimization if the original multidimensionality of the problern is initially translated into an one-dimensional environment. As a consequence of this translation, the circuit optimization becomes restricted by some constraints inherent to one-dimensional retiming of recursive problems, such as limitations on the performance improvement due to the the number of delays existing in a cycle. Most of the research in this area has followed the one-dimensional approach and is, consequently, subject to the same constraints [13] , [25] , [26] . One-dimensional retiming is also widely used in software and high-level synthesis optimization tools [4] , [ 191. In this paper, an algorithm based on a multidimensional retiming technique [3] , [21] , [22] is used to model the placement of registers along the circuit data paths, and to restructure the initial memory elements. The objective of this new algorithm is to improve the circuit performance by achieving full parallelism among all operations in the circuit. Due to the utilization of the multidimensional retiming concepts, it is not constrained by the number of existing registers in the circuit. This extraordinary capability introduces a high flexibility in the transformations that can be applied to the circuit design.
Several studies on compiler optimization tools have aimed at the parallelization of uniform nested loops, a similar view to the circuit optimization problem addressed in this paper. For example, unimodular transformations [2] , [28] , nonsingular matrices transformations [12] , and loop skewing [27] . These techniques do not change the structure of the iterations and, therefore, are not able to achieve a fine-grain parallelism. Other techniques that explore the multidimensionality of the loops are perfect pipelining [l] , Doacross loops [6] , and loop quantization [ 181. These methods require the expansion (unrolling) of the loop in order to achieve the target parallelism, increasing the size of the problem.
Some research has been done in multidimensional retiming. In [3] , multidimensional retiming is applied under the specific constraint of multidimensional delays with nonnegative components. The same constraint appears in [20] , where a method is presented to transform the design of two-dimensional (2-D) filters such that a high pipelined solution can be achieved. In [21] , an ILP solution has been developed for the multidimensional retiming problem of a more general graph. All these three cases can be regarded as special cases of our technique when applied to the circuit optimization problem.
An example of modification based on one-dimensional retiming can be found in the design of a 2-D filter in a study by Gnanasekaran [7] . The author assumes a row-wise execution of the 2-D problem and then implements the delay in the column direction as a fixed-size number of registers dependent on the problem size (row length). After this, a one- dimensional retiming and a data path redesign are applied to relocate the registers to produce a shorter cycle time. The resulting design has a nonoptimal cycle time equivalent to the sequential execution of one multiplication and one addition. This approach can be explained as a special case of our technique. In this paper we show that the execution time can be reduced to the time of the longest atomic operation. The same example involving a similar view to 2-D retiming appears in the studies by Wu [29] , [30] . Again, the lack of emphasis on the retiming process limits the solution to the same cycle time obtained by Gnanasekaran, i.e., one adder and one multiplier. Wu uses blocking techniques, that require multiple processors, to improve the overall throughput of the circuit to one operation time. This problem is also discussed in [20] , where a change in the sequence of the computations results in a circuit design with pipelined multipliers in order to improve the execution time. In our study, we present some of the theoretical aspects behind multidimensional retiming, as well as the consequences in the hardware design. A simple method of finding the multidimensional retiming function is presented in order to reduce the cycle time to the execution time of one operation (the longest one), using only one processing element with multiple functional units. A multidimensional data flow graph (MDFG) is used to represent the original problem, and is optimized by our transformation, before being translated back into the circuit design.
A. Some Examples
For simplicity, we use 2-D problems as examples. The two dimensions are generically referred as x, also said row direction, and y or column direction. Let's examine a simple example representing a portion of a wave digital filter (WDF) shown in Fig. 1 . The MDFG in Fig. l is used to indicate a delay element in the i direction, 1 for z-direction and 2 for y-direction. We assumed that the computation follows a row-wise sequence and that the total number of points in the x-direction is M and in the y-direction is N . Therefore, the 2-D delay (-1,l) is usually implemented
The delay (1,l) is translated into a serial placement of zc' and 2;' elements. The current cycle time for this design is equivalent to the sequential execution of two additions and one multiplication. Handling the MDFG representation by using our algorithm, we apply a 2-D retiming (2,O) to node D and (1,O) to node A, resulting in the MDFG shown in Fig. 2 (a) with a cycle time equivalent to one multiplication. A new sketch for the circuit design is shown in Fig. 2(b) . We notice that for the row-wise sequence of execution, the chosen retiming function is equivalent to the placement of single buffers in the paths D + A , A 4 B, and A + C, while reducing the initial queue sizes. A similar example with different dependencies is shown in Fig. 3(a) . In this case, the initial assumption on the computation sequence strongly affects how much the design can be optimized. To verify this, let's consider the same conditions as in our previous example, i.e., a row-wise computation. This assumption implies the design shown in Fig. 3(b) , where the element 2 ; ' represents only one delay. The application of one-dimensional retiming can not reduce the initial cycle time due to the constant number of delays in the cycle
At this point, one can try to change the sequence of execution in order to obtain a larger queue in the path C 4 D.
However, a wrong choice may introduce a new constraint. For example, if one chooses to do the computation according to a column-wise sequence, the ( 0 , l ) delay will be equivalent to a single register, becoming the new constraint. Using our technique on the MDFG representing the circuit, we bypass such a constraint, and produce an optimized design with cycle time equivalent to the execution of one element. In this situation, the final graph and design after retiming the MDFG is shown in Fig. 4 . Later, we will show that the new element zs-' is equivalent to one register, while 2 ; ' and 2;' have a known maximum size. If we assume that all operations are unit-time, the multidimensional retiming improves the execution of lone iteration by lhree times. Actual time improvements are presented later, when we discuss the simulation of implementation of such a circuit using CMOS technology and standard cell techniques through electronic CAD tools [ 141-[ 161. These simulations result in small differences in the circuit layout area when compared to the magnitude of the performance improvement.
B. Memory Requirements
It is also important to keep the number of inserted registers as small as possible. When there is no technique involved in the optimization process, one can select a multidimensional retiming function and sequence of execution that may introduce large queues depending on the ]problem size. Fig. 5 (a) relpresents the iterations described by the MDFG in Fig. 3(a) and their inter-dependencies, distributed in a Cartesian space. A schedule vector s is used to represent the execution sequence of the iterations [9] . It is the normal vector for a set of pasallel hyperplanes that define such sequence of execution as shown in Fig. 5(b) . This sequence of execution is commonly known, in the design of multiprocessor systems, as wavefront processing [ 101. In a isingle processor environment, nodes in a same hyperplane will be executed sequentially, according to a second level of schedule associated with the hyperplane. The example shown in Fig. 6(a) A row-wise execution is equivadent to traverse a hyperplane overlapping a row, before advancing the execution to the next row (hyperplane). This execution sequence implies a schedule vector (O,l), normal to the row direction. For the valid retiming function r ( A ) = (-1,2) and r ( D ) = (-2,4), the resulting delays are those shown in Fig. 6(b) . It is easy to notice that the initial row-wise execution sequence is no more valid. However, the schedule vector :i = (I, 1) represents a valid sequence of execution. With this new schedule vector, the original problem consisting of one queue and three registers now requires :5 queues (one in each path).
To manage the selection criteria for the retiming function in an appropriate way, we explore some properties of the hyperplanes associated with the schedule vector, such that the new memory elements to be inserted in the circuit are not dependent on the problem size, i.e., their size does not depend on the number of points in any direction of the problem. As in one-dimensional retiming, input data i s considered to be available from memory elements not considered in the solution of the problem, or from real-time systems that can be adapted to provide the data in the sequence that they are required for execution.
These examples show how difficult it is to obtain the desired circuit optimization under the several constraints of the problem. The method described in the following sections allows the designer to make use of automatic tools to find the answer for the optimization problem. Section I1 shows basic concepts and terminology, such as mathematical models for dependence graphs and data flow graphs, and presents an introduction to multidimensional retiming. Section I11 explores some concepts in wavefront processing and how it affects the size of the queues used as memory elements in the circuit. Section IV shows the properties necessary to have a legal multidimensional retiming function and describes our circuit optimization algorithm. Section V shows some examples of application of our technique, followed by a summary of the concepts introduced in this paper.
BACKGROUND

A. Modeling the Problem
In this section we present some basic concepts and definitions in the interpretation of multidimensional data ff ow graphs, and how they relate to the circuit design. Some of the points presented include mathematical models, dependence vectors, iterations, and vector operations. We begin by describing a multidimensional data flow graph, and how it relates to a circuit design.
is a node-weighted and edge-weighted directed graph, where V is the set of computation nodes, i.e., the functional elements in the circuit design, E C V x V is the set of dependence edges, equivalent to the circuit data paths, d is a function from E to Z", representing the multidimensional delay between two nodes, implicitly indicating the storage elements required in the circuit design, and t is a function from V to the positive integers, representing the computation time of each node. A 2-D data flow graph
formulation of any delay shown in a 2-D DFG. Looking at the MDFG shown in Fig. 3 We use the notation U 5 U to indicate that e is an edge from node U to node U. The notation U L v means that p is a path from U to U. The delay vector of a path
the total computation time of p is Cf=, t(w;).
To manipulate MDFG characteristics represented by vector notation, such as the delay vectors, we make use of component-wise vector operations. For example, two 2-D vectors P and Q , represented by their coordinates ( P . z, P . y) 
B. Retiming a Multidimensional Data Flow Graph
In this section, we show the basic ideas of multidimensional retiming, including cycle period, critical path, and characteristics of legal multidimensional retiming. We begin by describing the concepts related to the computational execution time.
The period during which all computational nodes in an iteration are executed, according to the existing data dependencies and without resource constraints, is called the cycle period.
the maximum computational time among paths that have no delay. For example, the MDFG in Fig. 3(a) 
The cycle period of an MDFG is equivalent to the cycle time of the circuit design that realizes the MDFG. In order to optimize the cycle period of an MDFG, we transform the graph using a multidimensional retiming function [3] , [21] .
A multidimensional retiming r is a function from V io 2" that redistributes the nodes in the cell dependence graph showing the replication of 
THE WAVEFRONT EXECUTION APPROACH
As we saw in the previous section, the schedule vector plays an important role in the multidimensional retiming process, since it guarantees the realizability of the resulting MDFG. In this section, we discuss the relationship between the schedule vector and the storage elements in the circuit design. We also explore the consequences of using orthogonal and nonorthogonal schedule vectors, i.e., how the memory elements in the circuit design are affected by the choice of the final schedule. The concepts covered in this section are important since they clarify how the retiming function can affect the distribution of memory elements in the circuit.
We already mentioned that in a 2-D approach, the usage of an orthogonal schedule vector irnplies a row-wise or columnwise execution. A row-wise execution is equivalent to a schedule vector (0,l) and it implies that delays (1,O) must be translated into single register delay elements. For the same row-wise execution sequence, delays of the form ( d . x , 0), where d . x > 0, will produce queues of size d . x, which is independent of the number of points in the z and y-directions (the problem size). However, if M is the number of points in the z-direction, the delays of the form (0, d . y) are equivalent to queues of size d . y x M for d . y > 0. Fig. 9 shows what would be a sequence of execution imposed by a rowwise computalion in a 2-D problem. We notice the progress in the y-direction as indicated by the schedule vector (O,l) , and a faster recurrence in the x-direction. Intuitively, any data produced in a row to be used in the next row at the same column, dependence (O,l), requiires a storage of size M . The size of this queue will be larger if the data is going to be used by a row that is at a distance larger than one from the current row, for example, dependence (0,2) implies a queue size 2M. In the general case where the schedule vector may be nonorthogonal, this distance is measured as the distance between the hyperplanes determined by the schedule vector.
The wavefront approach was initially proposed in 1974 by Lamport in [lo] , in order to produce parallelism among iterations contained in nested loops which could be found in computer programs. Later, this approach was extended to the design of array processors 191. These applications focused on the usage of multiprocessors to run the iterations, which is not the objective of this paper. However, using such concepts, we suggest a methodology on how to estimate the queue sizes: we generalize the computation of the distance between hyperplanes, Ah, for each dependence vector d as Ah = s . d where s is the schedule vector. When Ah = 0, then d indicates either a zero delay or a delay vector perpendicular to s. In the first case, d represents a dependence in the same iteration which is equivalent to a direct data path in the circuit design. In the second case, d indicates a dependence between iterations in the same hyperplane, which in a 2-D problem is equivalent to a fixed-size queue. It is important for the designer to minimize the size of this queue. We say that the size of this queue is minimum if there is no vector f and integer constant A such that d = A f . An example of such concepts can be obtained from a problem that has a nonorthogonal schedule vector. Fig. 10 shows an iteration space and the execution sequence for a schedule vector (1,l) . We can see that delay vectors with value (-1,l) would represent a dependence in the same hyperplane and according to the concepts above, they would represent dependencies between two neighbor nodes in the hyperplane. Each of these dependencies would be translated into one register at the circuit level. For this same example, a delay d = (-2,2) would imply two registers, and d = (-t, t) would be translated into a queue of t elements.
implies that w immediately precedes U .
IV. THE OPTIMIZATION ALGORITHM
A. The Multidimensional Retiming Function
After identifying some important aspects of relating dependence vectors to memory elements according to some schedule vector, we must be sure to obtain a legal multidimensional retiming function. The technique of multidimensional retiming allows changes in more than one direction [3], [21] . Our method uses this property to eventually achieve full parallelism by incrementally applying the retiming function to multidimensional problems. In this section we introduce the properties, algorithm and supporting theorems for the multidimensional retiming function. A detailed discussion of multidimensional retiming is found in [21] .
The capability of achieving full parallelism allows the designer to obtain the lowest possible cycle time. Without loss of generality, we assume all elements in the circuit to have execution time equal to one. We say that a multidimensional retiming for a realizable MDFG G = (V, E , d, t ) is legal if it transforms G in a new MDFG G, such that G, is still realizable. A general set of constraints for a legal multidimensional retiming that guarantees full parallelism among the nodes of an MDFG G = (V, E , d , t ) Intuitively, we know that the strictly positive scheduling subspace S+ is not empty, whenever the MDFG is realizable.
Using such a concept, we introduce the method of computing a legal multidimensional retiming. Initially, we define the basic conditions for a legal multidimensional retiming through the following property.
Property 4.1: A legal multidimensional retiming r on an
that s is a schedule vector for the retimed graph G,, has the following characteristics:
for any path U 5 w,
Through the following theorem, we get the theoretical foundations to support the selection of a legal multidimensional retiming function. 
0
We conclude: that given a set of dependence vectors, after defining its strictly positive scheduling subspace, it is possible to predict a legal multidimensilonal retiming function. The application of such a retiming function transforms the MDFG into a graph with a new set of dependence vectors. We can now apply this multidimensional retiming to the optimization of circuit designs by constructing an MDFG representing the multidimensional problem, optimizing the MDFG, and redesigning the circuit through a translation process where the retimed graph nodes are the circuit elements and the delays are translated into queues (FIFO elements), according to the chosen schedule viector. To keep the retimed dependence vectors equivalent to rninimum size queues, we choose the retiming function in such a way to produce minimum delays as seen in the Theorem 3.1. We define minimum multidimensional retiming function as follows:
Dejinition 4.2: A minimum multidimensional retiming r is a legal multidimensional retiming, such that there is no integral vector f and integer constant A such that r = A f .
Here we introduce two important corollaries from Theorem 4.1. Corollary 4.2 gives us the alternative of grouping nodes when doing the multidimensional retiming, while Corollary 4.3 explores the calpability of using multiple values of the retiming function for a set of nodes when necessary. 
Considering the concepts seen above, we propose the following steps to find a legal multidimensional retiming: we begin by selecting a schedule vector s that satisfies the inequalities s 8 d(e) > 0 for every e E E and d ( e ) # (0, 0, . . . 0). We choose a minimum retiming function from the hyperplane witlh s as the normal vector. The selected retiming function when applied to any node that has all incoming edges with nonzero delays is a legal multidimensional retiming according to Theorem 4.1, and can be used to produce the desired fully parallel solution. legal retiming i~n G. 0
B. The Multidimensional Optimization Algorithm
We now describe a method for optimizing a circuit designed to solve multidimensional problems. We call it the MDR algorithm and it is described according to the following steps: Since there is no zero-delay cycle for a realizable MDFG, the result is a direct acyclic graph (DAG). 5. Perform a modified topological sort algorithm [24] to order the nodes in levels. Each node is labeled according to its level number, which produces a monotonically increasing characteristic of the node indices in any path in the graph. We call a maximum length k the highest level number obtained in the construction process described above.
6.
Retime each node v, by ( k -i) x r , where i represents the level to which the node belongs to. This is a legal retiming according to Corollary 4.3. The result is a fully parallel MDFG G, which implies a cycle equal to the longest node execution time in t. 
GET(u, Queue);
PUT(PRED(u), Queue) k + + /* final retiming function (step 6) */ V u E G Let us examine how the algorithm works. In the first step, either the original problem or an existing circuit design is translated into its equivalent MDFG G, i.e., operations become nodes, while data dependencies or circuit paths are transformed into edges of the graph. Since there is no specific requirements for the schedule vector s, this can be arbitrarily chosen from the strictly positive scheduling subspace for G. Let us revisit the example in Fig. 1 . One possible way to select s is to find the outermost delay vectors in the set of nonzero delay vectors used to represent the inter-iteration dependencies. Adding up these two vectors, and verifying that the resulting vector is not perpendicular to one of the outermost vectors, implies that the resulting vector belongs to the strictly positive scheduling subspace. In the example, since there are no other vectors, the two outermost ones are (-1,l) and (1,l). Therefore, s = (-1,1)+(1,1)= (0,2).
After choosing s, we must choose the minimum multidimensional retiming T perpendicular to s. In the example, T =(1,0). In the next step, we must find the coefficients for the multidimensional retiming of each node, such that all edges will have nonzero delays. This implies in the successive application of multidimensional retiming to several nodes in a zero delay path. Since T is the only retiming vector found by our algorithm at this point, using Corollary 4.3, we must apply multiples of T to each node. Therefore, it is important to assign a coefficient to a node U greater than the coefficient to be applied to a node U if there is a zero delay path from U to U. To find out the values for such coefficients, a topological sort is done, assigning each node to a level that indicates its distance from the beginning of any zero delay path. Edges (-1,l) and (1,l) are removed from the graph in order to execute the topological sort, that assigns level 0 to node D, level 1 to node A and level 2 to nodes B and G. The maximum length is k = 2. Using these values, the retiming function of each node is then computed, i.e., r ( D ) = (2,0), r ( A ) = (1,O) and r ( B ) = T ( C ) = (0,O). This is equivalent to the fully parallel solution shown in Fig. 2 . As consequence of the algorithm just presented, we state a full parallelism theorem as follows: Proofi By using the multidimensional retiming algorithm MDR, we transform C to G. In the next step, we find T , a
IIR section of the filter.
legal multidimensional retiming for G, k the maximum length of G, and i the level number assigned to each node w E V . The retiming of each node w,
or j x T where 1 5 j 5 k -1. By Corollary 4.3, the retimed MDFG is realizable, and since there is no zerodelay edges, then G, is fully parallel. Translating G, onto a circuit design, produces C, which is also fully parallel. The complexity is easily measured by examining the complexity of each step of the algorithm: steps 1, 4, 6, and 7 have complexity O(lEl), since they are equivalent to adding or removing each edge from the graph.
Step 2 
for the cyclic MDFG.
0
An important characteristic of this algorithm is that it guarantees for 2-D applications, the insertion of either single registers, or fixed number of registers (fixed-size queues), into the original direct paths, instead of large queue elements dependent on the problem size. Therefore, we describe such a characteristic of the fully parallel retimed circuit in the following property of the retiming function for 2-D problems.
Property 4.2:
Given a circuit design C that is represented by an MDFG G = (V, E , d, t ) , the minimum retiming T for G, and s the schedule vector for the retimed graph G, = (V, E , d,, t ) , then all zero delay edges U + w of G, i.e., direct paths on C, after retiming, will have fixed-size queues of size at most f , when (f x .)(U) is the retiming function applied to U in the algorithm MDR.
Proo$ Consider an edge U 3 ' U, with d ( e ) = (0, a . . ,O).
After retiming, & ( e ) = d(e) + .(U) -T ( U ) = .(U) -~(w).
According to the topological sort from algorithm MDR, U will be assigned to a level IC1 and v to kz, such that IC1 < k z . Let us assume that the longest zero delay path, i.e., the highest level assigned to some node by the topological sort is k , are
The fully parallel solution requires the simultaneous execution of nodes belonging to different iterations on the original problem. It is clear that if the problem size is smaller than the highest level assigned by the topological sort to the nodes of the MDFG, thien, there will not be enough iterations to be executed in piarallel. This situation usually results in some improvement in the execution time but will not be able to achieve the desired fully parallel execution. However, in practical situations, the problem size is expected to be orders of magnitude larger than the number of operations in the circuit, allowing the tlotal utilization of the parallel functional units. The next section illustrates the use of our algorithm in an filter design benchmark found in [7] , [20] , [29] , [30] .
to a queue of fixed size in the range 1 to f .
V. EXPERIMENTS
In this section we show the de:tails of the application of our method to the 2-D filter design presented by Gnanasekaran [7] . We also show the results of implementing the examples discussed in this paper, through iihe usage of Mentor Graphics tools [14]-[16] .
A. Iteration Time Improvement
We begin bsy discussing the cycle time improvement obtained in the example of the 2-D filter design found in [7] .
This example consists of a filter represented by the transfer function
The original circuit design is slhown in Fig. 1 1. We represent it by its equiv,alent MDFG shown in Fig. 12 . Only the nodes belonging to the IIR section of the filter are labeled since the solution for the FIR section is considered trivial. Assume both adders and multipliers take one time unit to compute. Using the multidimeinsional retiming algorithm MDR, we begin by translating the IIR section of the circuit in the MDFG seen in Fig. 12(b) . In the next step, we find a possible schedule vector. We recall that the use of a nonorthogonal schedule vector requires the input data to be available in an equivalent ordering, or if the data results of a single line (or column) scanner, that data be stored in such a way to allow the use of the new schedule. In the former case, such a storage may be equivalent to the problem size. The selected schedule, for our experiment, is s =(l,l). We proceed by choosing the minimum multidimensional retiming function according to the third step in our algorithm, obtaining T = (-1,l). Then we perform the topological ordering of the DAG associated with the IIR section of the graph as shown in Fig. 13 . The node labeled A6 is assigned to level 0, node A8 to level 1, M6 to level 2, A4, M 2 , M3, M5, and M7 to level 3, and finally, A l , A2, A3, A5, A7, M 1 , M4, and M 8 to level 4. Therefore the maximum length of any path in the IIR section is 4. In the next step, we apply the multidimensional retiming function to each node that has at least one zero delay outgoing edge. The resulting retiming functions are: ( -3 , 3 ) , r ( M 6 ) = (-2,2), and
The FIR section of the filter is of trivial solution since there is no cyclic dependence (nonrecursive graph). Using known retiming properties we insert as many delays are necessary into , we notice that in our case, the critical path is reduced to one functional element. Therefore, we are able to produce a new design with a cycle time equal to the execution time of one multiplication (assumed that the multiplication has a longer execution time than one addition). This represents a gain equivalent to the time of one adder over the previous existing results. The final result is then considered to be optimal. It is clear that the full parallelism produced by our algorithm will always guarantee optimal results.
B. Implementation Characteristics
We simulated the implementation of the examples presented in this paper utilizing CMOS technology and standard cell techniques. The original storage elements were assumed to be implemented in a external memory unit. This allowed our simulation to consider only the additional area required by the new delays being placed in the circuit. Fig. 16 shows the rearrangement applied to the original storage elements in Fig. 11 to obtain an isolation of the such elements from the remaining of the circuit. The shaded area represents the external memory unit. To justify this new arrangement, we recall that those storage elements if considered part of the circuit would imply a predefined schedule vector and the problem would be constrained by the one-dimensional retiming limitations. Also, for an original row-wise execution of a problem of 1000 x 1000 points, those queues would have a size equivalent to 3 orders of magnitude larger than the new elements being inserted in the circuit by our algorithm. Such a difference could result in small variations in the circuit area hiding the most important results. Eventually, if the schedule vector did not change, existing queues would be reduced proportionally to the number of delays placed inside the circuit, which would also reduce the impact of the multidimensional retiming changes. Therefore, we can conclude that the simulations were done considering the worst case for area evaluation. For the other examples, a similar rearrangement was not necessary since the queues distribution was already in the desired format.
The standard devices used in the implementation simulations are presented in Table 1 .
As mentioned before, these devices were chosen from a CMOS standard library and the number of transistors reflect their actual CMOS implementation model. The propagation delay is equivalent to the execution time of each operation, while the area cost is a characteristic of such models proportional to the lay-out area required by each cell. This area cost is available to the software optimization tools for automatic compaction of the final lay-out. -- Fig. 17 . Filter optimized by Gnanasekaran. of the multidiniensional retiming on the circuit area can be obtained from ' Table I1 which summarizes the results of our simulated implementations:
This table shows the simulation results for the total layout area required by the original design and by the retimed circuit. The area values are measured in X2 where the value X will be decided depending on the manufacturing technology. The value used for X in our simulations was 1. micron, but there are lower values available in the industry. The table also shows the number of transistors and the cycle time, i.e., the critical path execution time, for each case. The differences were computed as positive numbers when they imply an increase in either the area, or number of transistors. Negative numbers are used to show the cycle time reduction to reflect the circuit performance improvement.
The example named 2AM in the table is the one shown in Section I, Fig. 1 with two adders and two multipliers. Its time improvement, represented by a negative percent value, is equivalent to the reduction of the original execution time of a series of 2 adders and one multiplier to one multiplier. However, the total additional area required is about 6% of the original circuit. The IIR case is the IIR section shown in Fig. 11 . It contains 8 multipliers and 8 adders, and its original execution time was equivalent to one multiplier followed by 4 adders. After the rearrangement of the queues, it became equivalent to one multiplier followed by 5 adders. The value recorded on the table, however, shows the original time equivalent to four adders and one multiplier, so the computation of the difference is not affected by our intentional redistribution of the queues. The multidimensional retiming of this problem results in a new delay on each of the edges of the graph, i.e., one delay on each data path of the circuit. The final performance is about 3 times faster than the original circuit. The additional lay-out area resulted in 4.7% of the original area which can be considered negligible when compared to the performance optimization obtained. Finally, the FIR section of our example was implemented, resulting in similar observations.
The example named GNA shows the comparison between the simulation of the implementation of the optimized solution presented in [7] (Fig. 17) and the solution obtained by our method. As we have done before with the other examples, the original queues of length greater than one were translated in external memory elements and not considered in the implementation. As we can see in the table, the time improvement of one adder resulted in a 29% increase, for a reduction of 1 % in the layout area, possibly due to the inconsistency of the layout tool. After evaluating these numbers, we can observe that the additional area required in the implementations is worthy the additional improvement.
VI. CONCLUSION
We have presented a novel technique for optimizing a circuit designed to compute multidimensional applications. The circuit was modeled as a multidimensional data flow graph and it was transformed through the use of a multidimensional retiming method that we developed. We have demonstrated that using a sequence of execution associated to hyperplanes as in a wavefront approach allows us to insert single registers in the circuit paths to optimize its cycle time. Common mistakes that could produce results requiring new large queues are avoided through the use of our algorithm. This technique is able to achieve full parallelism, i.e., simultaneous execution of all operations (nodes) of the circuit (multidimensional data flow graph). A pre-selected multidimensional retiming function is the core of the algorithm. The process of selecting such a function was presented. Differently from one-dimensional case where the retiming technique can not guarantee full parallelism, the main theorem of this paper shows that a multidimensional problem can always achieve full parallelism efficiently in O( IEI) time.
The algorithm was presented in detail and proved to be correct. A comparative example on the iteration time improvement based in [7] has been shown. The resulting cycle time for one iteration obtained for this example is equivalent to the longest operation in the circuit, i.e., one multiplication, which is less than the one previously obtained of one addition and one multiplication. Simulated implementations of some of the examples have shown that the additional lay-out area required by the additional delays is small when compared to the performance improvement. These results demonstrate the effectiveness of this new method and its great potential in circuit optimization. 
