Abstract
Introduction
Computation-intensive applications usually depend on time critical sections, consisting of loops of instructions also called iterations. The design of application-specific solutions for such sections improves the computing performance. A multi-dimensional (MD) retiming transformation is presented in this study to achieve an additional improvement, while considering the consequences of such t ransforinations on the memory requirements.
Retiming was initially proposed by Leiserson-Saxe [4] focusing on 1-D problems. Such a technique presents a lower bound in the achievable cycle time due to the the number of delays existing in a cycle. Most of the research in this area has followed this approach and consequently subject to the same constraint [2] . Multi-dimensional systems have been covered in studies using linear program-'This work was supported in part by ORAU Faculty Enhancement Award under Grant to model the placement of registers along the circuit data paths, while considering the consequences in the memory structure. This new method is applicable to any problem involving more than one dimension. We used the notation z,-l to indicate a delay element in the i direction. If assumed that the computation follows a row-wise sequence and that the total number of points in the row-direction is M , then the 2-D delay ( 1 , l ) can be represented by a FIFO structure of size M + 1, i.e., a serial implementation of zF1 and zT1 elements. The 2 ; ' element representing the 2-D delay (1,O) is equivalent to only one delay. The current cycle time for this design is equivalent to the sequential execution of two additions and one multiplication. In this case, the initial assumption on the computation sequence strongly affects how much the design can be optimized, since the 1-D retiming can not reduce the initial cycle time. By using our algorithm, we apply a 2-D retiming ( -2 , 4 ) to node D and (-1,2) to node A, resulting in the fully parallel solution shown in figure 2 with a cycle time equivalent t o one multiplication. 
. s. We have now two cases: Here we introduce an important corollary from theorem 3.1. Corollary 3.2 explores the capability of using multiple values of the retiming function. The final schedule vector plays an important role in the MD retiming process. In this section, we discuss the relationship between the schedule vector and the storage elements in the circuit design.
In a two-dimensional (2-D) problem, a row-wise execution is equivalent to a schedule vector ( 0 , l ) and it implies that delays ( 1 , O ) must be translated into single register de- Figure   3 (a) shows a sequence of execution imposed by a row-wise computation in a two-dimensional problem. We notice the progress in the y-direction as indicated by the schedule vector (0, l ) , and a faster recursion in the x-direction. Nonorthogonal schedule vectors define execution hyperplanes that require a more complex formulation of the queue sizes. Because these hyperplanes are not parallel to the orthogonal axis x or y, the number of points vary according to the boundaries of the iteration space and the slope of the hyperplane. The following lemma shows how to compute the maximum number of inCegral points in an execution hyperplane for a two-dimensional iteration space with equal number of points in both directions: tors presented earlier, the size of the queue for those delay vectors would be reduced to the term Ph(d), which r e p resents a fixed size queue not dependent on the problem dimensions. Figure 3(b) shows an iteration space and the execution sequence for a schedule vector (1,l). Delay vectors with value (-1,l) represent a dependence within a hyperplane, and are translated into one register at the circuit level. In order to optimize the memory design, we try to select a schedule vector s such that the term s ' d is small or eventually zero. Therefore, we redefine the constraints in the selection of the schedule vector by allowing s . d = 0. Intuitively, we know that the non-zero delay vectors whose products with s could result zero must be the outermost dependencies in the span of non-zero delay vectors. However, we also know that a first translation of dependencies in registers can introduce the constraints found in the 1-D retiming. Therefore, to avoid such constraints we introduce a new selection criteria for s. We assume, without loss of generality that all nodes represent unit time operations and define such criteria in the following theorem. After choosing the schedule vector, we need to keep the retimed vectors equivalent to minimum size queues. We choose a retiming function in such a way to produce minimum delays, as defined below: 5 The Algorithm I n a first approach for transforming a 2DFG and, consequently, its equivalent circuit into a fully parallel MDFG, our algorithm verifies if there is a negative cycle in the iteration space whenever the retiming vector used is parallel LO any of the outermost delay vectors. We call this procedure, shown below, ParallelRetim. It is the first section of the main algorithm described later.
ParailelRetirn( G )
/ * compute the outermost dependence vectors */
/-modified single-source shortest path follows */
if ti 4 Queue then PUT(v,Queue) if U = t a d then count + +, t a i l +LAST(Queue) } found -(Queue = 6 ) }
return (found)
If the chosen retiming is not valid, a new retiming vector, not parallel to any delay vector, is chosen. We call the main algorithm Comdr for Circuit Optimization via MultiDimensional Retiming. The mathematical description of this algorithm is presented below: On the general case, the desired cycle time may be large enough to accommodate the sequential execution of two or more operations. In such a situation, a slight modification of the algorithms, combining those nodes that serially executed would fit in the target cycle time, produces a circuit (not fully parallel) with a lower number of non-zero delay edges. satisfying the specified cycle time. Such a solution is not presented in this paper due to space constraints.
Example
I n this section we present the application of our method to a 2-D filter design with transfer function Therefore, the selected MD retiming function is r = ( 0 , l ) .
The algorithm produces the following coefficients for the retiming function: A5 is assigned to -4, node A4 and the original cycle time, we notice that the critical path was reduced from 4 adders and one multiplier to one functional element, with a cycle time equal to the execution time of one multiplication. This represents a gain equivalent to 5 time? in computational time. Another important result on this example is that all new delays have size one for the schedule vector s = (1,O). It is clear that the full parallelism produced by our algorithm will always guarantee optimal results with respect to the cycle time of the circuit.
