In this paper, we propose a systematic pipelining method for a linear system to minimize power and maximize throughput, given a constraint on the number of pipeline stages and a set of resource constraints. The method first retimes operations such that as many operations as possible take common operands as their inputs, and then performs the operand sharing based on the list scheduling. Experimental results show that the proposed approach reduces the power consumption of the functional units by up to more than 20%, compared to the state-of-the-art pipelining and operand sharing techniques.
INTRODUCTION
Researches on pipelining of a linear system have been done in various areas. Two representative areas are code generation in compilers for embedded VLIW processors and high-level synthesis for ASIC design. Previous researches on pipelining of a linear system are primarily for throughput maximization under resource constraint [4] , latency minimization under resource and throughput constraint [3] , and joint throughput and latency optimization under resource constraint [5] . However, for the design of an embedded real time linear system that is gaining more and more attention nowadays, we often need to minimize the power consumption while considering the throughput and latency constraints, which has never been considered by the previous researches. In this paper, we propose a systematic pipelining method to minimize power first and then to maximize throughput, given a constraint on the number of pipeline stages (PSs) and a set of resource constraints, for a linear system. Unlike most of existing pipelining approaches, our method takes the number of PSs as one of constraints and views the pipelining with respect to power minimization. The number of PSs is related to code size when considering code generation whereas it is related to controller and register overhead in the context of high-level synthesis. Therefore the ability to handle the number of PSs as a constraint is important in both areas. Given the number of PSs as a constraint, maximizing the throughput in the proposed approach corresponds to minimizing the latency. Note that the system latency can be computed as the number of PSs divided by the system throughput. In this paper, we Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISLPED '01, August 6-7,2001 , Huntington Beach, Califomia, USA. focus on applying the proposed pipelining technique to high-level synthesis. The technique, however, can also be applied to code generation for an embedded VLIW processor, mutatis mutandis. Given a DFG (Data Flow Graph) as an input, the operand sharing technique tries to bind operations with a CO (Common Operand) to the same FU (Functional Unit) such that the input activity of the shared FU decreases. There have been a few research results about power reduction using the operand sharing technique [ 1] [2] . Their limitation is that the DFG given as an input has only a small number of operation nodes whose inputs are common, resulting in a shallow chance of operand sharing and insignificant power reduction. There has been a technique proposed to overcome such limitation by generating as many operation nodes with COS as possible through loop pipelining [6] . But the technique does not consider constraints on the number of PSs and throughput. In addition, the approach does not deal with applications with feedback edges, that is, loopcarried dependencies. This paper proposes systematic pipelining to solve different problems mentioned above. Our approach is an extension of [6] in terms of the basic concept that it generates operation nodes with COS, which is invisible in the original DFG, via the pipelining technique. However, it is completely new in that it proposes a novel pipelining algorithm to deal with feedback edges in a DFG efficiently as well as to consider the number of PSs and throughput when generating nodes &th COS.
PRELIMINARIES 2.1 Data Flow Graph
We use a DFG as a model that represent a simple loop body of a linear system. Figure 2(c) shows the DFG of a simple second order IIR filter. The number beside each edge is the weight that denotes the iteration difference between the source node (data producer) and the target node (data consumer) of the edge and is represented by w(ek), b'ek E E . Figure 1 illustrates the binding altematives after scheduling. Assume that we allocate two multipliers, multo and multl, for synthesis. Binding in Figure l (b) maximizes the temporal correlation of input signal sequence since the input operands, fed to input of multo, are fixed within the same iteration and even preserve the original correlation well across consecutive iterations. It decreases the switched capacitance of the multo and therefore its power consumption. It is reported that the typical value of P1P2 that is computed through switch level simulation is about 0.65 for a 12-bit multiplier [l] , where P1 and P2 are respectively the average power consumptions of the multiplier when only one operand changes and when both operands change simultaneously.
Operand Sharing

Retiming
Retiming is a transformation which increases the throughput of a loop or improves the utilization of resources by introducing partial overlap between the execution times of successive loop iterations in the original description, that is, pipelining several loop body iterations. Retiming fhction r(n), v n g N is the number of delays drawn from each of the incoming edges of node n and pushed to each of the outgoing edges. By changing the position of delays through retiming, weight, w(e,), ve, E E of each edge on the original DFG are transformed to a new weight, w,(ek)=u(ek)+r(n,)-r(n,),Vek E E , Where ekis an edge from node nt to n,. A retiming is legal if w, (e, 1, v e k E E is nonnegative [ 121. Figure 2 shows the loop pipelining example using retiming technique, which is a simple second-order IIR filter. Assume that one multiplier is shared by the multiplication operations in the loop body. As shown in Figure 2 (d) and 2(Q, moving one delay on the incoming edge of the upper left multiplication node to its outgoing edge corresponds to pipelining nodes fiom two iterations, that is, the upper left multiplication node from (i+l)th iteration, and the remaining nodes &om (i)th iteration, respectively. As a result of loop pipelining, the critical path length is reduced by the elimination of the intra-iteration dependency from the upper left multiplication node to an addition node. Therefore, the initiation interval of the loop is reduced fi-om 3 to 2 time steps. Note, however, that such an improvement in throughput is achieved at the cost of controller and register overhead by the increased number of PSs. We also observe that only one multiplier is allocated to perform the two multiplications with one common operand within one iteration after pipelining, as shown in Figure 2 (d) and 2(Q and, through the switching activity reduction of input signals to the multiplier, the pipelining reduces power consumption. This is an important motivation of our approach, which will be described in detail in the following section.
MOTIVATION
To alleviate the limitation on the number of nodes with COS in the original DFGs, we present a novel loop pipelining method. While existing loop pipelining transformations try to maximize the throughput or minimize the latency of a loop, our pipelining algorithm retimes operation nodes such that as many nodes as possible take COS as their inputs and performs scheduling and binding based on operand shaving concept, and therefore reduces the switching activity of FUs, especially of multipliers, while still trying to maximize the throughput of the loop. This transformation has a significant power-reducing effect on linear applications such as filters. We illustrate in detail the motivation briefly mentioned at the end of the previous section, using the simple second-order IIR filter in Figure 2 . The switching activity of the multiplier is determined by the change of values of the two input operands occurring between consecutive executions. As shown in Figure 2 (a), 2(c), and 2(e), without any transformation, both input operands change their values twice per iteration. Now, let's consider pipelining two consecutive iterations like Figure 2 (b), 2(d) and 2(Q. Then one input (s[n-11) of the multiplier changes its value only once at every iteration and we can save some power consumed by the multiplier through the switching activity reduction. Obviously, after pipelining, the two operands of the two multiplication operations (one for each) become common. As shown in Figure 2 (Q, we can directly apply operand sharing to fanout edges of a node whenever the edges have the same weight and the target nodes of those edges are bound to the same FU. Figure 3 shows the DFG for a typical second order IIR filter. The second order IIR filter contains only one compatible edge set which is shown in Figure 4 (a). It consists of four edges whose target multiplication nodes, *a, *b, *c, and *d take the value ftom an addition node, +g as their inputs after one or two iterations. Assume that the DFG in Figure 3 is onginally non-pipelined and so initially all nodes in a DFG are put in one PS. We also assuine that one multiplier is allocated for the multiplication operations. If appropriate scheduling is performed and therefore four multiplications are executed in the order of *a, *c, *d, and *b by the multiplier, one input value of the multiplier changes three times within each iteration and across successive iterations. If two PSs are used so that *b, *c, +g, +f, and +h nodes are executed at the second stage, we have the same weight for the edges in the compatible edge set, as shown in Figure 4 (b). Now one input value of the multiplier does not change within each iteration and changes once across successive iterations, irrespective of the scheduling, which contributes to the power reduction of the allocated multiplier. The pipelining in Figure 3 is regarded to be a retiming which moves one delay (weight) on the incoming edges of *a and *d to their fanout edges. Relation between loop pipelining and retiming to generate COS is explained in detail in the next section through a novel forcedirected retiming which is the first phase of our pipelining algorithm. When performing loop pipelining for CO generation, we must consider constraints such as throughput, latency, and available resource. In this paper we consider throughput under the constraints on resource and the number of PSs, although the primary concern is power. Figure 5 shows an overall process of OUT systematic pipelining to solve the problem. Our approach is composed of three phases in a main loop, following the preprocessing step. In the first phase of our pipelining method, we propose a novel technique called a forcedirected retiming which retimes operation nodes to allow more nodes to take COS. We perform the operand sharing on the retimed DFG, based on the list scheduling, in the second phase, and then check to see if the scheduled result satisfies the given constraints, in the last phase. If the result satisfies all the constraints, it is called feasible and is regarded as one of the solutions. In case when the result is not feasible, we try to make it feasible by performing additional incremental folding and rescheduling on nodes except for the nodes made to take COS by the force-directed retiming. To explain the details of each phase of the algorithm including preprocessing, we take the example of second order IIR filter shown in Figure 3 and assume that three adders and two multipliers are allocated with three PSs. We also assume, for the time being, that the execution times of the adders and multipliers are one time unit, but our algorithm can handle multi-cycle and pipelined FUs. 
SYSTEMATIC PIPELINING FOR LOW
Preprocessing
Lower bound on initiation interval (line 1): Lower bound on 11, as already presented in lots of papers [3] [4] [5] , is determined by the resource constraint and the lengths of cycles in the DFG. In the case of IIR filter, the lower bound is 2. The pipelining algorithm initially takes the lower bound as one of constraints as well as the number of PSs and the resource constraint and tries to get solutions. If there is no solution, I1 is increased by one (line 15, 16). Compatible edge sets and a PPS set (line 2, 4, 5 and 6): The algorithm finds compatible edge sets (line 2). As mentioned in Section 3, target nodes of all edges in a compatible edge set can take a CO as one of their inputs through retiming. However, to determine whether a target node can take a CO or not, we need to fix the source node to a specific PS. First, we determine the range of PSs and time steps, ([(PS, TS)S, (PS, TS)L]), called PS&TS kame, for each operation node (line 4). The PS&TS frames are computed by performing pipelined ASAP and L A P scheduling under the I1 and PS constraints. We use the algorithm proposed in [5] for the pipelined ASAP and slightly modified it for the pipelined L A P . In the IIR filter example, the PS&TS frames for *a, *b, *c, *d, +e nodes are [(0, J, (1, J] and those for +f, +g, +h nodes are [(I, J, (2, J], where -denotes the number from one to 11. Then, based on the PS&TS frames, we generate a set of vectors called PPS set (line 5). Each vector in the set represents a combination of PSs where the source nodes in compatible edge sets are positioned. Assuming n compatible edge sets, the vector is in the form of [PS(l), PS(2), ...., PS(n)], where PS(i) denotes a PS where the source node of i-th compatible edge set is positioned. Our algorithm shown in Figure 5 obtains solutions for all the vectors in the PPT set, in other words, all possible combinations of placing source nodes of compatible edge sets in their PS ranges (f?om line 6 to 13) and determines the best among the solutions according to the criteria such as power, 11, and turn-around time (latency) (line 18).
The IIR filter example contains only one compatible edge set (shown in Figure 4 (a)) as mentioned briefly in the previous section.
Its source node +g has PS&TS fiame [(I, J, (2, J] and one PPSset with two elements, 1 and 2.
Force-Directed Retiming
To determine the optimum locations of the target nodes within the PS kames in such a way that more nodes take COS as their input values, we use force-directed algorithm (line 8 in Figure 9 .' The procedure Force-directed retiming/clustering is shown in Figure 6 . do until ( Figure 6 . Force-directed retiming/clustering. Definition 2. CO probability Pij(k) denotes the probability that the target node of edge i in compatible edge set j is placed in a PS such that the edge i gets weight k. Definition 3. CO type distribution Q(k) denotes the sum of CO probability Pij(k) over the edges in compatible edge set j.
Definition 4. Force Fij(k) is defined as Fij(k)=Q(k) -((&Q (k))/ (PSL -PSs
+ l), where PSL and PSs are respectively upper bound and lower bound of the PS frame of the target node of edge i. This definition is the same as that in force-directed scheduling [7] .
In the IIR filter example, assume that source node +g is put in PS 1. Then Pij(k) and Q(k) are computed as shown in Figure 7 .
Since we want to let as many compatible edges as possible have the same weight, we select the largest type distribution, which is Q(1) in our example. Then we compute the forces Fij(1) only for the selected CO type distribution. The algorithm selects the target node with the largest force and retimes it at the corresponding PS. The reason why we select and retime the node with largest force first for the CO type distribution is that it contributes most to the corresponding CO type distribution. The same process is continued for the target node with the next largest force. During this process, we partially bind the selected nodes to the same FU. The number of the selected target operation nodes is limited by the ceiling of the total number of target nodes divided by the total number of FUs of the corresponding type. The limit in the number of the selected target nodes is the maximum number of operation nodes with a CO that can be executed by an FU. Setting such a limit evenly distributes the load over the FUs, resulting in improvement of resource utilization. In our IIR filter example, because two multipliers are available for execution of the total four target operation nodes, two nodes are selected for one multiplier. Although F*aj(l) is the lowest, we place *a first since it is on the critical path. Then we place *d since F*dj(l) is the highest. Therefore, in our example, F*aj( 1) and F*dj( 1) are selected for one multiplier. In case ' Before applying the force-directed algorithm, we need to update the PS&TS frames of the target nodes (line 7 in Figure 5 ) for a given placement of source nodes. of tie, the algorithm first considers a node belonging to a1 cycle in the DFG. It is because such a node is less flexible in determining the PS. After selecting the target nodes up to the limit, the algorithm repeats the same process for the remaining target nodes for another FU. It continues until the PSs of all target nodes are determined.
Calculation of CO Probabilities
Selection of Largest CO Type Distribution P*aj(l) = 1 (Critical) P*cj(O) = 112, P*cj(l) = 112 P*dj(l) = 112, P'dj (2) As a result of force-directed retiming for the compatible edge set of the IIR filter, PSs for *a, *b, *c, *d nodes are determined: PSO for *a, *d and PS1 for *b, *c. Finally, the remaining nodes that are not related to the set are put into their earliest PSs (PS's) not to have negative effect on system latency.
Operand Sharing and Feasibility Test
After PSs of all nodes are determined, operand sharing based on list scheduling is performed for the retimed DFG. Priorities of nodes are given as follows.
1. Operation nodes whose sibling operation nodes (operation nodes bound to the same FU a priori) have ALREADY been scheduled get higher priority.
2. In the case of tie, consider urgency [9] in the retimed DAG. This is consideration of II.
3.
If tie is not broken, consider urgency in the original DAG. This is consideration of latency.
M e r scheduling, we check the scheduled result to see .if it meets I1
and PS constraints. The test result is classified as one of the following four cases. The list scheduling result of the IIR filter in Figure 8 belongs to Case IV and contains two excess nodes, +f and +h, which are not related to COS. By performing incremental folding and rescheduling, we obtain the final scheduling result with 3 PS and 2 I1 like Figure 9 for the IIR filter example. For the solution in Figure   9 , one multiplier executes two multiplication operation nodes, *a and *d with COS and the other one executes *b and *c nodes with COS. Note that this is one of solutions that our systematic pipelining algorithm generates for low power under the given PS and resource constraint. Figure 9 . The final schedule.
EXPERIMENTAL RESULTS
We implemented our pipelining method using Cti-under uNu( environment. Our implementation takes a VHDL description for a linear system and compiles it into a DFG. With the DFG, constraints on the number of pipeline stages, and a resource allocation table given by the user, the pipelining method produces a retimed, scheduled, and partially bound DFG, on which normal FU, register, and interconnect binding is performed.
For the experiment, we selected linear systems from the well-known HYPER examples [9] . To show the effectiveness of the proposed method through comparisons, we also implemented the state-of-theart loop pipelining method proposed in [5] , which minimizes the latency under constraints on target initiation interval and resource, and operand sharing algorithm proposed by Musoll et. al. [2] . We first compared our pipelining with existing loop pipelining [5] followed by normal binding to show that it produces the result with less power consumption, while having the same initiation interval. Next our pipelining algorithm is compared with the existing pipelining algorithm followed by binding, which is based on operand sharing, to indicate that the synthesized results from our power-conscious pipelining consume less power than those from combination of such separate techniques.
To estimate power consumption, we modified SPA [8] , an RT-level power estimation tool that is based on Dual Bit Type model. For verification of its reliability, we compared the estimated powers of FUs for the second order IIR filter with those estimated by IRSIM [ll], a switch level simulator, running on the simulation file extracted from the layout. For automatic generation of the layout, we used the LagerIV layout synthesis system [ 101 after transforming the synthesized DFG into .sdl file format, which is an input format of LagerIV system. We used 64 samples of speech data for functional simulation and power estimation. Table 1 denotes that our estimator is reliable because it has only 2.4% error in evaluating the effect of the power reduction in FUs. The 2nd order IIR filter in Figure 2 was synthesized under two different resource constraints. The first resource constraint is given as two array multipliers and two ALUs. We assume that all the multipliers and ALUs take one cycle to execute. Column 5 and 6 in Table 2 show the comparisons both between our approach and the existing pipelining and between our approach and the combination of existing pipelining and operand sharing, in terms of switched capacitance. In this example, the ratio of the switched capacitance in FUs to that in overall system is about fiom 65% to 70%. The proposed method reduces the switched capacitance in FUs by about 30% and that in the overall system by about 21%. We also present results for other benchmark examples in Table 4 . First column shows the number of operation nodes of each type and the number of compatible edges, for each benchmark example. IIR7 is 7th order IIR filter that has a high potential for CO generation. Parallel is a parallel form of Avenhaus filter and lattice is a lattice filter. Second column shows the initiation interval and the number of pipeline stages. Third column shows resource constraints. In fourth, fifth and sixth columns, we show the estimated switched capacitance values in FUs, after applying the existing pipelining, combination of the existing one and operand sharing, and our approach respectively. We assume that an ALU takes one cycle and a multiplier takes two cycles to execute. The same values in some rows in column 4 and 5 mean that no CO is generated after the existing pipelining is applied, or normal binding following existing one is the same as binding based on operand sharing. Compared to the two techniques, switched capacitance reductions of up to 22 % in FUs are obtained by the proposed method. As shown in column 7 in Table 4 , our pipelining algorithm consumes less power under the same initiation interval as the existing one. Column 8 in Table 4 shows that our algorithm also obtains bigger power reduction than the combination of existing pipelining and operand sharing. It denotes that our CO-centric pipelining approach is effective with respect to power reduction, compared to combination of such separate techniques. Rows in boldface in Table 4 are exceptional cases. In row 8, combination of existing pipelining and operand sharing has a little smaller switched capacitance than our approach. The reason is related to constants fed to one of inputs of multipliers. While our approach produces four pairs of COS, the combination obtains only two pairs of COS. But it also has two pairs of constant COS which are bound to the same multiplier and can be executed successively. As a result, both have the same number of pairs of COS. The fact that the value in column 5 is a little smaller than that in column 6 illustrates that the effect of constant COS on power reduction is a little bigger in our experiment. Exception in row 6 has a similar reason. Note that the power consumption of modules other than FUs, can increase as shown in Table 3 . Generally speaking, pipelining and operand sharing techniques may increase the number of registers needed because variables have longer life times. Since the target architecture of our hardware synthesis system is based on point-topoint interconnection scheme, the increase of the number of registers needed causes the increase of the number of buses (which means the interconnection between an FU and a register or the register-to-register interconnection in our current target architecture). This is a shortcoming of our approach in that the power consumption in buses is becoming more and more dominant as the technology goes into deep submicron. In the bus-based interconnection scheme, we can have more efficient use of interconnects through bus allocation and scheduling. We expect that by combining our pipelining with the bus-based interconnection scheme, better results in terms of power consumption in the overall system can be obtained.
CONCLUSIONS
In this paper, we proposed a systematic pipelining method for a linear system to minimize power and maximize throughput, given a constraint on the number of pipeline stages and a set of resource constraints. Our method deals with loop-camed dependencies in a DFG using the proposed force-directed retiming, while considering throughput and latency. Experimental results over a set of linear systems show that our CO-centric pipelining algorithm obtains power reduction of functional units by 13.9% on the average under the same initiation interval over the conventional one arid reduction of 9.8% on the average over the combination of conventional one and operand sharing.
