Abstract-In this paper, 1 we present a power-management methodology targeted toward high-level synthesis of datadominated behavioral descriptions. It is founded on the observation that variable assignment can significantly affect power-management opportunities in the synthesized architecture, i.e., variable assignment determines whether or not spurious operations get executed by functional units in the architecture. We introduce perfectly power managed architectures, whose functional units do not execute any spurious operations. We present a variable assignment technique which, when used in high-level synthesis, produces architectures which are perfectly power-managed. Unlike many previously proposed powermanagement techniques, our method does not add latches or any other circuitry in front of functional units or registers and is, therefore, free of the attendant performance penalty. Experimental results indicate savings of up to 52.5% (average 23.0%) in power consumption over already power-optimized architectures. The area overheads due to our technique are also low and averaged 2.5% for our examples.
I. INTRODUCTION

P
OWER dissipation in today's circuits is dominated by the dynamic component, which is incurred whenever signals in the circuit undergo a logic transition. In practice, a large fraction of the transitions incurred during the operation of typical circuits is unnecessary, i.e., it has no bearing on the final result computed by the circuit. Equivalently, not all parts of a circuit may need to function during each clock cycle, i.e., some components may be idle in some clock cycles. Recognizing this fact, several low-power design techniques have been proposed that are based on the idea of suppressing or eliminating unnecessary signal transitions. We use the term power management to refer to such techniques in general. Applying power management to a design typically involves two steps: identifying idle conditions for various parts of the circuit and redesigning the circuit in order to eliminate switching activity in idle components. Power management Publisher Item Identifier S 1063-8210(99)01553-X. 1 See the Guest Editorial of the Special Section on Low-Power Electronics and Design of the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, vol. 6, pp. 518-519, Dec. 1998. is frequently deployed by designers of power-constrained systems [1] , and is arguably one of the most commonly used low-power design techniques. Hence, it is desirable to have power management incorporated into automatic synthesis tools as well.
Many modern microprocessors have adopted the strategy of gating the clock input to registers and other circuit blocks in order to suppress unnecessary transitions in the clock signal as well as in the circuit block under consideration [2] . Automated synthesis techniques to apply clock gating and maximize its efficiency have been described in [3] . Recently, high-level synthesis techniques for power management have been proposed [4] - [6] . A scheduling algorithm, which aims to maximize the idle times for functional units, was presented in [5] . A controller respecification technique, based on redesigning the controller logic to reduce the activity in the components of the datapath, was presented in [6] . Techniques geared toward maximizing the "sleep times" of storage elements, such as registers and memories, were presented in [7] . At the logic level, two successful power-management techniques, based on guarded evaluation [8] and precomputation [9] , have been presented.
Our work is geared toward enhancing power-management opportunities during high-level synthesis of data-dominated behavioral descriptions. It consists of obtaining a poweroptimized register-transfer level (RTL) implementation from a data-flow graph (DFG) representation of the circuit. Datadominated behavioral descriptions are commonly encountered in signal-and image-processing applications, and are characterized by a predominance of arithmetic operations. Past work has focused on solving the problems of allocation (deciding the numbers and types of functional units and registers available for synthesis) and assignment [binding an operation (variable) to a specific instance of a functional unit (register)] [10] , functional-unit selection (selection of a functional-unit type to implement an operation) [11] , and scheduling (determining the cycle-by-cycle behavior of a circuit by assigning operations to control steps) [12] in high-level synthesis for low power. DFG transformations for low power have been considered in [13] . A technique which simultaneously solves the allocation, assignment, clock selection (choosing the clock period), functional-unit selection, selection (choosing the supply voltage), transformation, and scheduling problems for low power has also been presented [14] . It has been observed that functional-unit power consumption dominates the overall power consumption of these circuits, and that a significant fraction of this power is consumed when the functional units do not produce useful outputs. Conventional power-management techniques, many of which involve placement of transparent latches at functional-unit inputs, can increase circuit delays along the critical path. Thus, they may not be acceptable solutions for heavily performance-constrained designs.
Consider a behavioral description where scheduling and resource sharing have been performed by a generic behavioral synthesis tool. Consider a functional unit in the RTL implementation. During the control steps in which the functional unit is utilized to perform some computation from the behavioral description, the functional unit is said to be active. During other control steps, the functional unit is said to be idle. Though a functional unit need not perform any computation in its idle control steps, the inputs to the functional unit may change values in the RTL implementation, causing unnecessary power dissipation. In this paper, we show that the manner in which register sharing is performed can significantly affect the unnecessary power dissipation in functional units during their idle cycles. A natural followup to the above observation is the question of whether register sharing can be constrained in any way, without incurring excessive overheads, so as to enable better power management of the functional units. We present a constrained register-sharing technique, which can be integrated into existing high-level synthesis systems, to produce architectures whose functional units do not waste power. In order to evaluate our techniques, we have incorporated them into the framework of an existing poweroptimizing high-level synthesis system, which is described in [14] . Experimental results indicate up to twofold reduction in power at area overheads not greater than 8.3% without any degradation in performance, compared to architectures which are already power optimized. Our procedure complements known register-sharing techniques for low power, such as [10] , which attempt to minimize switching activity at the output of registers during active clock cycles.
The paper is organized as follows. Section II establishes that register binding can have a significant effect on power management. Section III presents conditions for perfect power management. Section IV discusses the integration of these conditions into an existing high-level synthesis system. Section V presents experimental results, and Section VI the conclusions.
II. EFFECT OF REGISTER SHARING ON FUNCTIONAL-UNIT POWER CONSUMPTION
In this section, we present some preliminary concepts and motivate the key ideas presented in this paper through examples. We first describe the architectural model for the RTL circuit. We then establish that the manner in which register allocation and variable assignment are done can have a profound impact on spurious switching activity in a circuit. We then illustrate the elimination of spurious switching activity from a design through a combination of techniques, with a small overhead in the number of registers. We conclude with an example that demonstrates that spurious switching activity can sometimes be eliminated without increasing the number of registers in the synthesized architecture.
A. Architectural Model
The architectural model assumed is shown in Fig. 1, where , , indicate functional units , indicate registers, and , indicate multiplexers. We use a point-to-point interconnect model, with multiplexers serving as the interconnect elements. In general, the inputs to the functional unit are routed from outputs of registers through multiplexers, and the inputs to registers are routed from functionalunit outputs and primary inputs through multiplexers.
Traditional power-management techniques involve the placement of transparent latches at the inputs of a functional unit executing spurious operations [4] , [8] . These latches are disabled when the functional unit is idle, thus suppressing spurious operations. This has two disadvantages. The signals that detect idle conditions might arrive late and, therefore, be unable to prevent the execution of spurious operations on the functional unit [15] . Insertion of transparent latches in front of functional units can lead to additional delay in a circuit's critical path. This may not be acceptable since many of today's signal-and image-processing applications need to be fast, as well as power efficient. As an alternative, we propose judicious assignment of variables to registers to ensure the absence of spurious operations.
Example 1: Consider the scheduled DFG shown in Fig. 2 . Each operation in the DFG is annotated with its name (placed inside the node) and the name of the functional-unit instance it is mapped to (placed outside the node). Each variable in the DFG is annotated with its name. Clock cycle boundaries are denoted by dotted lines. The schedule has five control steps, . Control step is used to hold the output values in the registers and communicate them to the environment that the design interacts with, and to load the input values into their respective registers for the next iteration. In order to assess the impact of variable assignment on power consumption, consider the two candidate assignments Assignment 1 and Assignment 2, shown in Table I . The architectures obtained using these assignments were subject to logic synthesis optimizations, and placed and routed using a 1.2-m standard-cell library. The transistor-level netlists extracted from the layouts were simulated using a switchlevel simulator with typical input traces to measure power. For Design 1, synthesized from Assignment 1, the power consumption was 30.71 mW, and for Design 2, synthesized from Assignment 2, the power consumption was 18.96 mW.
To explain the significant difference in power consumption of the two designs, let us analyze the switching activities in the functional units that constitute Design 1 and Design 2 using Figs. 3 and 4, respectively. In these figures, each functional unit is represented by a labeled box. The vertical lines which feed the box represent a duration equaling one iteration of execution of the DFG. Each control step is annotated with: 1) the symbolic values that appear at the inputs of the functional unit in the implementation and 2) the operation, if any, corresponding to the computation performed by the functional unit. For example, for functional unit in Fig. 3 in control step , variables and appear at its left and right inputs, respectively. The computation performed by in control step corresponds to operation *1 of the DFG. Cycles in which one or both inputs of a functionalunit change, causing power dissipation, are shaded in the figures. Each variable change can be associated with the execution of a new operation. Cycles during which spurious input transitions, which do not correspond to any DFG operations, occur are marked with an . The power consumption associated with these operations can be eliminated without altering the functionality of the design. A functional unit does not perform a spurious operation when both its inputs are unaltered. A cycle in which an input of a functional unit does not change is not marked with any variable (note that the select signals to multiplexers are configured to select functional-unit inputs, which change only when necessary, using the technique presented in [6] ). An inspection of Figs. 3 and 4 reveals that, while the functional units that constitute Design 1 execute eight spurious operations, the functional units that constitute Design 2 execute only one spurious operation. This explains the difference in power consumption between the two designs.
Example 2:
The previous example illustrated that constrained register sharing can significantly reduce the number of spurious operations in a circuit. This example illustrates a technique, called dynamic register rebinding, which, in combination with variable assignment, can completely eliminate spurious operations. Consider functional unit in Design 2. An inspection of Fig. 4 reveals that it executes a spurious operation in control step of every iteration. This is because the multiplexer at its input selects register , to which variable is assigned, from control step of each iteration to control step of the next iteration. Since acquires a new value in control step of each iteration, computes in control step , which is spurious, since the value of corresponds to the previous iteration while the value of corresponds to the current iteration. This problem would persist even if were selected at the input of , instead of . This is because is generated only at the end of control step , causing to evaluate with the old value of . Note that, in order to avoid the spurious operation, it is necessary to preserve the old value of (from the previous iteration) at the input of until the new value of in the current iteration is born. In this case, the spurious operation can be eliminated, without paying a price in terms of the number of registers used, by swapping the variables assigned to registers and in alternate iterations. In even iterations, is mapped to and to , and vice versa in odd iterations. The technique used to accomplish this is detailed in Section III-B.
In Examples 1 and 2, it was shown that, for a small increase in the number of registers, it is possible to significantly reduce or eliminate spurious operations. However, spurious operations can sometimes be eliminated at no extra register cost, as illustrated by the following example.
Example 3: Consider the scheduled DFG, shown in Fig. 5 . Table II shows two candidate variable assignments, Assignment 1 and Assignment 2, both of which require the same number of registers. The architectures corresponding to the two designs were simulated with typical input traces to deter- The above examples serve to illustrate that judicious variable assignment is a key factor in the elimination of spurious operations. Our work aims to eliminate spurious operations in functional units. The architectures produced by applying the techniques presented in this paper are referred to as perfectly power managed since the functional units in these architectures do not execute any spurious operations.
III. PERFECT POWER MANAGEMENT
In this section, we derive conditions whose satisfaction ensures the existence of the perfect power-management property. Consider a functional unit , which performs an operation of type , executing operations , , , , and , which are scheduled in time steps , , and , respectively. Operation is fed by variables and , at its left-and right-hand side inputs, respectively, and generates variable . consumes power whenever any of its inputs changes. The total power consumed can be written as (1) where is the variable at the left (right-hand side) input of before the change and is the new variable, and is the power consumption associated with the variable change. This is because several operations may be mapped to a single functional unit. Note the use of primed variables to denote functional-unit inputs and the use of unprimed variables to denote operands. This notation is adopted because the variables that actually appear at a functional unit's inputs are a superset of those that feed the operations performed by it, and need to be denoted differently. For example, if the DFG shown in Fig. 2 Example 4: Fig. 6 shows a functional unit in an imperfectly power-managed architecture on the left-hand side and a functional unit in a perfectly power-managed architecture on the right-hand side. The lines at the left-and right-handside inputs of the functional unit denote the time axis and are annotated with input variables and their birth-control steps (note that , , etc., do not necessarily represent consecutive control steps). The shaded bars correspond to changes in input values and represent control steps in which power is consumed. As shown in the figure, in the perfectly powermanaged architectures, all power-consuming input changes correspond to operations required by the DFG.
A. Conditions for Perfect Power Management
The following three conditions are sufficient to guarantee perfect power management.
1) The sequence of values appearing at the left-hand-side input of in any iteration corresponds to , , , . 2) . In this case, in order to avoid the execution of spurious operations, we need to ensure that the new value of succeeds at the left input of . Therefore, must be selected at the left input of until control step . However, since acquires a new value in control step , the functional unit consumes unnecessary switching power in control step . Therefore, for perfect power management, the new value of needs to be stored in a separate register, leaving the old value intact, until it is no longer needed. Thus, sometimes different instances of the same variable, generated in different iterations of execution of the DFG, need to be mapped to different registers. This can be easily done at the expense of one extra register by buffering the new value of in a spare register before allowing it to feed . However, we next present a solution to this problem, called dynamic register rebinding, which avoids this register overhead.
Let and denote the registers to which variables and are mapped. If is free until the birth cycle of , in alternate iterations of the DFG, the variable assignments for registers and are swapped, i.e., if denotes the set of variables mapped to and denotes the set of variables mapped to in an even iteration, the variables in are mapped to in an odd iteration. Under this scenario, the old value of is preserved in until is born, which guarantees the absence of unnecessary switching activity in . Note, however, that this solution could lead to extra interconnect in the datapath because every functional unit initially feeding should now also feed , and every functional unit initially fed by should now also be fed by . In practice, for interconnect overheads (measured as number of 2-to-1 multiplexers added) up to a user-specified threshold, this method is adopted. For interconnect overheads exceeding this threshold, register duplication is preferred. Also, note that performing dynamic rebinding is conditional on being free until the birth cycle of . Hence, in general, it might not be possible to utilize this technique. In this case, we resort to buffering the new value of in a spare register until it needs to feed .
IV. THE ALGORITHM
The techniques of constrained register sharing, dynamic register rebinding, and register duplication presented in the earlier sections can be incorporated into any high-level synthesis system. In this section, we present our algorithm for perfect power management in the context of a power-optimizing highlevel synthesis system [14] . We begin with an overview of this system. This is followed by a description of the modifications necessary to produce perfectly power-managed architectures. Fig. 8 illustrates the basic steps of the high-level synthesis system, which is based on an iterative improvement technique which can escape local minima. The system accepts as input, a behavioral description compiled into a DFG, a library of modules precharacterized for area, delay, and power, a set of typical input traces to facilitate power estimation, and the sampling period (the time interval between the arrival of two successive iterations of inputs). Synthesis starts with the identification of a set of feasible clock periods and supply voltages. The clock periods and supply voltages that need to be considered can usually be pruned down to a small number. For each such pair, synthesis is initiated with the derivation of an initial solution, complete with scheduling, allocation, and binding information, which is transformed by the application of a sequence of moves (individual moves are allowed to have a negative gain). Moves can replace an instance of a functional unit in the datapath by one of a different type, merge pairs of functional units or registers into one, or split a functional unit or register into multiple functional units or registers. The power impact of each of these moves is evaluated and the most power-efficient sequence of moves that does not violate sampling period constraints is implemented. The boxes shown shaded in Fig. 8 indicate the synthesis steps which need to be modified to incorporate perfect power management. Our power-management technique is applied at three stages. The first two stages belong to move-selection procedures, and the third occurs after iterative improvement. The first two stages ensure perfect power management within each DFG iteration, but ignore inter-iteration effects described in Section III-B. The third stage, which occurs after synthesis, is intended for the correction of such inter-iteration problems.
A complete schedule and datapath exist at every step of the algorithm, i.e., before and after the application of each move. When a functional-unit selection, resource sharing, or resource splitting move is considered, the resulting architecture is first scheduled to ensure that a valid schedule exists. After an architecture is scheduled, variable lifetimes are checked to see if their intra-iteration extensions, required to ensure the absence of spurious operations, cause any conflicts in variable lifetimes. The initial solution is characterized by the absence of spurious operations since each variable is mapped to a separate register. Architectures which display spurious operations are rejected. Typically, spurious operations are introduced only by moves which attempt to merge different registers into one. Hence, this restriction does not, in practice, affect functionalunit sharing, or splitting, or functional-unit selection moves. Note that we do not check at this stage whether an input variable of the last operation (by ascending order of birth) mapped to a functional unit can be kept alive until all inputs to the first operation mapped to the functional unit are born in the next iteration. This is because, even if this problem is detected and solved at this stage, extra registers added or inter-iteration assignment swaps performed to fix the problem during one move of iterative improvement might prove inadequate or redundant because of future moves. A detailed description of the rationale behind different synthesis steps is omitted for brevity, but can be found in [14] . It has been observed that functional-unit selection, scheduling, and resource sharing affect each other, and the total power consumed by the design. Therefore, our synthesis system tries to interleave these interdependent tasks, and leverage off the benefits that arise from their interaction.
The pseudocode for the checking procedure is shown in Fig. 9 . The outer loop represents the traversal of the functional units that constitute the datapath. The operations mapped to the functional unit are then traversed in increasing order of birth. Each input variable of the operation is then checked to ensure that its death time can be extended up to the birth time of the latest arriving input of the operation which immediately follows . If so, the multiplexers at the input of the functional unit under consideration select variable to appear at the appropriate input until the latest arriving input of is born. If not, the architecture is not deemed perfectly power manageable and a different one is considered. Since the initial architecture is guaranteed to be free of spurious operations of the kind that this subroutine checks for, our synthesis algorithm always yields a valid solution.
Problems induced by inter-iteration dependencies are solved once iterative improvement terminates. The pseudocode for the procedure to handle inter-iteration dependencies is shown in Fig. 10 , and corresponds to the shaded box in Fig. 8 , which corresponds to the penultimate step of the synthesis algorithm. As before, the outer loop represents traversal of functional units in the datapath. In this case, we are interested in first-and last-born operations mapped to the functional unit. If an input variable to the last-born operation has a birth time less than the birth time of the latest arriving input to the firstborn operation, the architecture is not perfectly power-managed. This problem is handled either by register duplication or by an inter-iteration variable assignment swap. 
V. EXPERIMENTAL RESULTS
We have implemented the ideas presented in the previous sections within the framework of the power-optimizing highlevel synthesis system SCALP [14] . We have performed experiments to evaluate our techniques using several behavioral descriptions of signal-and image-processing applications. The modified SCALP reads in a textual description of a DFG, and performs selection, clock selection, functional-unit selection, scheduling, resource allocation, and binding; finally resulting in a power-optimized, perfectly power-managed architecture, which consists of a datapath netlist and a finitestate machine description of the controller. The controller and datapath netlists are merged and mapped to the MSU standard cell library using the SIS logic synthesis system. Placement and routing are performed using tools from the OCTTOOLS suite. A switch-level circuit extracted from the layout is simulated using the switch-level simulator IRSIM-CAP and the capacitance switched is recorded and used to estimate the power consumption. Since we estimate the power consumption from a layout-based simulation, it is possible to accurately estimate controller power, glitching power, interconnect power, clock power, etc., which cannot be easily accounted for at the higher levels. The input sequences used for simulation are obtained by first generating a zeromean Gaussian sequence and then passing the result through an autoregressive filter to introduce the desired level of temporal correlation. Table III shows the experimental results obtained. The first column gives the name of the circuit. Of our benchmarks, Paulin, Tseng, Dot_prod, Bandpass, and Dct_lee are wellknown high-level synthesis benchmarks. Lat and Fir are a part of the HYPER [13] distribution. Test1 is the DFG shown in Example 3.
Major columns P and A show the power consumption and area of the circuits synthesized without and with perfect power management. Note that, in either case, the control signals to the multiplexers and registers in the circuit were specified with the aim of optimizing the power consumption of the design, using the techniques presented in [6] . Minor columns SCALP and PPM refer, respectively, to architectures optimized by unmodified SCALP (without the aid of power management) and by techniques presented in this paper. Columns P.S. and A.O. represent the power savings and area overheads for PPM designs with respect to SCALP designs. Major column CPU time shows the central processing unit (CPU) times required to synthesize the normal and power-managed designs. These results were obtained on an SGI Challenge with 256 Mbyte random access memory (RAM) running at 150 MHz. The CPU time results obtained indicate that the synthesis times for designs produced with and without the benefits of our technique did not differ significantly. The sampling time constraint was fixed by performing as-soonas-possible (ASAP) scheduling after mapping each operation to the fastest possible functional unit, assuming each operation to be mapped to a separate functional unit. The cycle time of the synthesized designs was fixed at 25 ns and the sampling periods of the SCALP and PPM designs were constrained to be equal. The results obtained indicate that, for the same delay, our methods achieve a 23.0% average power reduction over already power-optimized circuits at an average area overhead of only 2.5%. Note that unmodified SCALP obtains up to about an order of magnitude reduction in power consumption over area-optimized circuits operating at 5 V. The above power savings are obtained on top of these already obtained power savings. In some cases, the area overheads are negative when there is no overhead in the number of registers, and the interconnect structure for the PPM design is more regular than the SCALP design.
To assess the impact of clock cycle time upon the obtained results, we synthesized the example Bandpass under two other clock cycle constraints: 12.5 and 50 ns, with and without perfect power management, subject to the same sampling period constraint. The power savings obtained were 8.8% and 8.7%, and the area overheads were 0.1% and 0.2%, respectively. Thus, power savings and area overheads were not significantly impacted by a change in clock cycle time.
To assess the impact of sampling period, we synthesized the Bandpass example under throughput constraints of 59, 72, and 86 cycles, in addition to the original throughput constraint of 32 cycles. The power savings obtained were 8.7%, 34.7%, and 30.9%, and the area overheads were 0.3%, 0%, and 13.8%, respectively. Thus, in general, power savings and area overheads increased with an increase in sampling periods. This is because increased sampling periods facilitate increased register sharing, which leads to an increase in the execution of spurious operations. In order to eliminate spurious operations, we need to inhibit register sharing to a greater extent than for tightly performance-constrained designs.
VI. CONCLUSIONS
In this paper, we introduced the concept of perfect power management, which is geared toward eliminating spurious switching activity in the functional units that constitute a datadominated circuit. We demonstrated that the manner in which variables are assigned to registers can have a profound impact on the existence of spurious operations. We derived conditions on variable assignment whose satisfaction guarantees the absence of spurious operations in the functional units that constitute a datapath. We developed an algorithm, based on these conditions, which can be incorporated into an existing high-level synthesis system, to eliminate spurious switching activity in functional units. We implemented our ideas within the framework of an existing high-level synthesis system, geared toward power optimization, and demonstrated up to twofold reduction in power (mean 23.0%, standard deviation 13.3%) at an average area overhead of 2.5% without any degradation in the performance of the synthesized architecture. 
Sujit Dey
(S'90-M'91) received the Ph.D. degree in computer
Introduction
High-level synthesis for low power takes as its input a behavioral description in the form of a data-flow graph (DFG) or control-data flow graph (CDFG) and outputs a power-optimized RTL circuit [1] [2] [3] [4] [5] . Power management has been recognized as a very useful technique for reducing power consumption [1] , [6] [7] [8] [9] [10] [11] . It is well known that register binding may introduce spurious switching activity in functional units they feed [8] , [9] . One way to eliminate such spurious switching activity is to add transparent latches at the functional unit input ports [8] . However, the power consumed by the latches themselves reduces the power savings. Also, these latches result in delay overheads that need to be taken into account.
Another approach for reducing spurious switching activity is to reconfigure the multiplexer networks and bind variables to registers in such a way that when a functional unit is going to be idle in the next control step, it takes its inputs from the registers it most recently used and these registers do not load in a new value in this step. In [9] , a technique is proposed to redesign the control logic to configure existing multiplexer networks to minimize (may not eliminate) spurious switching activity in the data path. On the other hand, a register binding method which guarantees a PPM RTL circuit for DFI behaviors, if the multiplexer network is properly designed, was presented in [10] . This method was extended in [11] to CFI behaviors. However, it could not ensure that a given set of functional units is PPM, i.e., did not provide a sufficient condition for CFI behaviors.
We present a sufficient condition to ensure that any given set of functional units in an RTL implementation of both DFI and CFI behaviors is PPM. No such condition has been presented before for CFI behaviors. Based on the sufficient condition, we propose a register binding algorithm for PM circuits.
Acknowledgments: This work was supported in part by Alternative System Concepts under an SBIR contract from Army CECOM and in part by DARPA under contract no. DAAB07-00-C-L516.
Our method does not impact the choice of the scheduling or functional unit binding algorithms and is easy to incorporate into existing high-level synthesis systems. Our experimental results show an average power reduction of 45.9% at an average area overhead of 7.7% compared to power-optimized RTL circuits.
The paper is organized as follows. In Section 2, we give an example to motivate our method. In Section 3, we present a sufficient condition for PPM register binding which leads to a register binding algorithm for power management, and discuss selective PM circuits. In Section 4, we present an experimental platform and one way for incorporating power management checks into existing high-level synthesis tools. We present experimental results in Section 5 and conclude in Section 6. Suppose we have two adders and one comparator. <1 and <2` are bound to the comparator, and +1 and +5 to one of the adders, say adder1, while +2, +3 and +4 to the other one, say adder2. Suppose we bind variables c and x to the same register, and let d have its own, as shown in Figure 2. 2. In state A, the two input values for adder2 correspond to c and d, respectively. Since variable x is defined in state A, the register which c is bound to feeds the value of variable x in state B in which adder2 is supposed to be idle. Assume in state B, the select signals for multiplexers MUX2 and MUX3 are all 0. Then the input values for adder2 correspond to x and d, respectively. This causes spurious switching activity in adder2 in the transition from state A to state B. One way to eliminate the above spurious switching activity is not to bind variables c and x to the same register. Instead, we can bind variables c and f to the same register. We can set the select signals of both MUX2 and MUX3 to 0 in state B, such that registers Reg1 and Reg3 are still selected to feed input ports in1 and in2 of adder2, respectively, as in state A. Moreover, since variable f is not defined in state A, register Reg1 still holds the same value in state B as in state A. Similarly, register Reg3 still holds the same value since no other variable is bound to it. Therefore, no spurious switching activity occurs in adder2 in state B.
Motivational Example
It is worth pointing out that in real designs, spurious switching activity could be much more serious than that in this example if it occurs inside a loop. 
PM Register Binding
In this section, we first present some rules to guide register binding in order to reduce the spurious switching activity in a given set of functional units in the data path. We then discuss the concept and implementation of retentive multiplexers which are used to realize PM circuits. Retentive multiplexers are multiplexers which can preserve their last select signal values in the control steps in which the select signals are don't-care. We then address the concept of selective PM circuits.
Register Binding for PM circuit
As mentioned before, the schedule of a behavior can be represented in the form of an STG. For each variable v, defstates(v) is defined as the set of states in which v is defined, and usestates(v) is defined as the set of states in which v is used. A variable is live in state p if there is a directed path in the STG from state p to another state that uses the variable, without going through any state in defstates(v) except state p itself. livestates(v) is defined as the set of states in which v is live. livestates(v) can be generated using liveness analysis algorithms from compiler theory [13] .
In the example in Section 2, we have Two variables a and b can share a register if they do not have overlapping lifetimes during circuit execution. This can be determined based on the defstates and livestates of variables a and b as follows [13] .
Theorem 1: Two variables a and b do not have overlapping lifetimes during circuit execution if livestates(a)∩defstates(b) = Ф, and livestates(b)∩defstates(a) = Ф.
We next introduce the notion of extended set of live states. For a variable v, its extended set of live states with respect to the functional unit F that it feeds is defined and computed recursively by Algorithm 1 in Figure 3 (2) 
□
In Theorem 2, condition (1) can be satisfied through proper register binding, while condition (2) is determined by the schedule and can be satisfied through variable re-naming and register reassignment. If condition (2) is not satisfied, we cannot guarantee the PPM property of the given set of functional units. However, satisfaction of condition (1) only will still significantly reduce the spurious switching activity in the circuit, as shown in Section 5. We call this PM register binding.
Retentive Multiplexers
The essence of a PM functional unit is to make sure that the input values to it remain the same as they were in the last state in which it was active. To achieve this, first, the select signals of the multiplexers feeding its input ports should retain the same values, and second, the register value which feeds the selected input of the multiplexer should remain the same. The latter can be ensured by using the proposed sufficient condition (Theorem 2). However, the former needs help from the controller since retentive multiplexers are assumed. We next provide two different implementations of retentive multiplexers.
The first implementation is based on the controller respecification method given in [9] . The don't-cares of the multiplexer select signals in the state transition table of the controller are identified and assigned proper values. The spurious switching activity can be statistically reduced if not eliminated. Such a multiplexer is called static retentive since the don't-cares are assigned values statically.
In the second implementation, we introduce an extra control signal to indicate whether the concerned functional unit is idle or not. A one-bit delay latch [12] is then added to the select signal input of the multiplexers in question. When the functional unit is idle, the latch is disabled by the added control signal and holds the previous value. Such multiplexers are called dynamic retentive and are the ones assumed in Theorem 2. The disadvantage of this implementation is the overhead introduced in the controller (one extra bit per functional unit) and multiplexers (one-bit latch for some 2-to-1 multi-bit multiplexers). However, compared with placing transparent latches before a functional unit input port [8] , the extra hardware for our approach is very small.
In practice, we have found that using static retentive multiplexers eliminates most of the spurious switching activity in a functional unit. The fact that they do not require much overhead makes them doubly attractive.
Selective PM Circuits
PM register binding tries to reduce spurious switching activity in functional units with the help of some extra registers needed for this purpose. The power/area of some register implementations can be comparable to those of some functional units. The ideal way for balancing register power consumption overhead and spurious switching activity in functional units is to compare the two for each register binding. This method requires spurious switching activity estimation for every affected functional unit for every register binding choice.
A simpler way is to only make power-hungry functional units PM instead of all functional units. That is, only binding for registers that feed power-hungry functional units is checked using Theorem 2. For registers feeding other functional units, binding is done without regard to whether it will cause spurious switching activity or not. In the RTL library we used, the following functional units were more power-hungry than registers: different types of multipliers, dividers and general-purpose ALUs. They were the functional units that were targeted in our experiments.
Experimental Platform
To evaluate the proposed method, we incorporated it into an existing low power high-level synthesis tool which can handle both DFI and CFI behaviors [5] . To ensure that for different register binding methods, the same functional unit optimization is done, we performed all functional binding first and then register binding, instead of interleaving them.
After reading the specification in the form of a CDFG and resource or timing constraints, the tool starts with either timeconstrained scheduling or resource-constrained scheduling. The resulting STG is simulated to collect data to evaluate the power and timing of the RTL architectures generated later. Data path optimization is based on variable-depth iterative improvement which is capable of escaping local minima. It starts with a fully parallel architecture in which each operation is bound to its own fastest possible functional unit from the RTL design library and each variable is bound to its own register. This architecture is iteratively improved for power optimization purposes through various moves such as functional unit selection, resource sharing/splitting, multiplexer tree selection, etc., first for functional units and then for registers. If a PM or a selective PM circuit is desired, condition (1) in Theorem 2 is checked before two variables are allowed to share a register. The power-optimized RTL data path is output at the end. This is the best architecture seen during iterative improvement. Then controller respecification is done as described in Section 3.2. That is, static retentive multiplexers are used for different register bindings, and don't-cares in the controller state transition table are appropriately specified.
Our register binding technique can be incorporated in other high-level synthesis tools as well. Once scheduling and functional unit binding information is available, register binding can simply be based on the sufficient condition given in Theorem 2.
Experimental Results
In this section, three register binding methods are compared. The first (maximal) allows sharing of registers as much as possible without regard to spurious switching activity. The second is register binding to obtain a PM circuit using Theorem 2. The third is register binding to obtain a selective PM circuit geared towards only power-hungry functional units in the data path. The controller is re-specified as described in Section 3.2 in all the cases. We show results for both static and dynamic retentive multiplexers. We used various high-level synthesis benchmarks to establish the efficacy of our technique. chemical is an IIR filter used in the industry. dct_dif, dct_lee and dct_wang are different algorithms for computing Discrete Cosine Transform [14] . diffeq is a differential equation solver from the NCSU CBL high-level synthesis benchmark repository [15] . wavelet performs the Discrete Wavelet Transform. These are all DFI behaviors. We also constructed a behavior, called con_loop, of two concurrent loops which are both control and data intensive. The VHDL code for it can be downloaded from [16] . The power consumption and area of the RTL circuits are computed using an NEC 0.35µm RTL library.
In Table 5 .1, we show the percentage reduction in total power as well as the overhead in area for PM circuits (i.e., all functional units are PM) compared to the maximal register binding case. On an average, the PM circuits reduce total power by 45.9% at an area overhead of 7.7%, compared to circuits which are power-optimized with spurious switching activity ignored.
The total power for the four cases, maximal, PM with static retentive multiplexers (PM), selective PM, and PM with dynamic retentive multiplexers, is shown in Figure 5 .1. The selective PM method is seen to be as good as the PM method in power reduction but requires less area overhead. Dynamic retentive multiplexers do help in two cases, but not in others.
We found that the average spurious switching power as a fraction of total power for maximal binding was 52.7%, whereas for the PM and selective PM circuits, the average was reduced to only 11.1% and 11.7%, respectively. Use of dynamic retentive multiplexers reduces spurious switching power to zero. The CPU times for high-level synthesis ranged from 9 seconds to 145 seconds on a Pentium-III machine with 256 MB memory running under Linux. 
Conclusions
We demonstrated that by properly binding variables to registers in high-level synthesis, spurious switching activity in functional units can be significantly reduced. We gave a general sufficient condition for a given set of functional units to be free of spurious switching activity in implementations of both DFI and CFI behaviors. Based on this condition, we proposed a register binding algorithm to reduce spurious switching activity in a given set of functional units. We achieved an average 45.9% power reduction at an average 7.7% area overhead compared to already power-optimized architectures. This condition can be applied selectively to power-hungry functional units in the data path.
We also discussed how the controller and/or multiplexers can be redesigned to cooperate with register binding to reduce spurious switching activity in the data path. Dynamic retentive multiplexers can eliminate most spurious switching activity but introduce some extra hardware which is very small compared to that of placing transparent latches before functional units' input ports. Static retentive multiplexers hardly require any extra hardware, but permit some spurious switching activity. In practice, we found that PM register binding with static retentive multiplexers can eliminate most of the spurious switching activity in the benchmarks. High-level synthesis (HLS) has been intensively researched for the productivity of VLSI design, and many high-level synthesis systems have been introduced. It has been shown that HLS tools are especially good choice for specific application domains such as digital signal processing (DSP) circuits or control-dominated circuits. These application areas usually require low powerconsumption for its mobile characteristics. There were many approaches in high-level synthesis to reduce the power consumption of the generated circuit. In [5] [6] functional unit allocation and storage unit allocation algorithms to minimize the switching activity of the nets are introduced. Various low power techniques are applied iteratively based on the switched capacitance calculation in [7] . In [3] inputs of functional units are suitably controlled by changing storage allocation so that functional units can execute only necessary operations. All these schemes assumes flip-flops as storage elements for several reasons. First, using flip-flops is easy to think and to apply algorithms since flipflops can be read and written at the same time. Second, use of latches usually requires two phase clocking scheme which designers are not pond of.
But power consumption of latches is about one third smaller than that of flip-flops and area is also about one third smaller than flip-flop [8] . In addition, the smaller clock capacitance in latch can reduce further power consumption in clock tree. Since much of power consumption in recent chips are due to the clock tree, using latches in high level synthesis of data path will be very beneficial in power point of view. There were some approaches to use latches instead of flip-flops to get these merits. One example in [1] showed up to 50% power saving by using latches. But using multipleclocking scheme is very unpleasant choice for chip designers. In [2] , storage elements implemented with flip-flops are replaced with latches if the input/output behavior of the circuit is not affected. It showed good results for control-dominated circuits, but in datadominated circuits not many of the storage elements meets the conditions for substitution and the change in the waveform of internal signals may cause increase in functional unit power consumption.
With these in mind we will propose a storage allocation method considering the latch-based storage element implementation with single clocking scheme. This paper is organized as follows: Section II describes previous approaches in more detail and Section III gives motivating example showing the possibility of power reduction using latches. Section IV described the binding algorithm used in our approach. Some excremental results are showed and analyzed in Section V and conclusion is made in Section VI.
II. PREVIOUS WORKS
In [1] , circuits are partitioned into n sub-networks so that the storage elements of each partition can be activated at n distinct time step and each partition is clocked by n non-overlapping clock of frequency of f/n, where f is the operating frequency of the circuit before partitioning. With this architecture, area can be increased in some degree since registers and functional units cannot be shared by different partitions, but the power consumption can be reduced since the sum of power consumption in each partition is usually smaller than that of the original circuit for the reduced load capacitance.
One more rationale for the power reduction in that architecture is the reduced operating frequency. Reduced frequency means smaller signal transition. But this is true only when the power management scheme is not applied to the original circuits i.e. all storage elements are clocked at every clock cycle. If power management scheme is applied to the original circuits, the storage elements change their values only when new values hvae to be updated by using gated clock or input multiplexers. The minimal required number of value change is not determined by the architecture but by the algorithm and a simple power management scheme guarantees that minimal required transition in the storage elements. So the transition number of the storage element in the partitioned circuit is not smaller than that of the original circuits.
In spite of this, the simulation results show that the partitioned circuits consume smaller power than that of the original circuits with conventional power management scheme. It is in one hand due to the effect of the reduced load capacitance mentioned above and in the other hand due to the reduction of power consumption in storage elements. Since the storage elements in each partition are guaranteed not to change their values in consequent clocks, all the storage elements can be implemented with latches without affecting the behavior of the circuit.
Similar approach to reduce power consumption in storage elements using latch is found in [2] . Some flip-flops in synthesized circuits are replaced with latches after high-level synthesis. Since latches are transparent during the clock is low (or high in positive lifetime storage allocation timing diagram level sensitive latches), the waveforms of the outputs of the storage elements can be changed after replacement. So this approach tried to find out the flip-flops that do not change the behavior of the primary output ports when replaced with latches. Two basic conditions for the flip-flops to be safely replaced with latches are that they should not be read and written at the same clock cycle and that they should not be connected to the primary output directly or through combinational logic circuits. The first is because the storage elements can make a combinational loop during transparent operation when they are implemented with latches and the second is because the primary output can be advanced half cycle earlier after replacement than before replacement. Sometimes it is difficult to find out the flip-flops that meet those conditions. In d atadominated applications many storage elements have self-loops, i.e. they can be read and written at the same time thus violate the condition. In control-dominated circuits, an additional condition on control data flow graph (CDFG) limits the substitution chance. Another demerit of this scheme is that unnecessary signal propagation during the latches are transparent can cause addition power consumption in functional units. In control-dominated circuits the power consumption in combinational logic generating next input values of storage elements are not so great, but in data dominated circuits, this power consumption cannot be ignored.
In contrast with the above approaches that tries to reduce the power consumption in storage units, there are other power management schemes to minimize the power consumption in functional units with the sacrifice in storage unit area. Among those, perfect power management scheme in [3] shows most noticeable results with negligible overhead. The key idea of perfect power management is to allocate storage units so that all functional units connected to each storage elements generate useful outputs. It is done by extending the lifetime of each variable until all next time input variables of the corresponding functional unit are available. This method can perfectly eliminate the spurious operation of all functional units but its lifetime modification causes increase in the number of registers. The area increased by the added registers is not so critical, but the power consumption in storage elements is comparable to that of the combinational parts in many circuits. So it gives more chance to reduce power consumption.
With all these in mind, we proposed an allocation scheme that allows all the storage elements to be implemented in latches while minimizing the spurious operations in functional units so that we can achieve low power architecture.
III. MOTIVATING EXAMPLE
Data flow graph in Fig. 1 will be used as an example in this section. Seven variables and three operations are there. Assume that scheduling and functional unit allocation is already done. Three operations will in executed in consequent control steps (S1, S2, S3) and the number in each node represents the number of functional units to execute the operations. Fig. 2 shows the lifetime of each variable, storage allocation result of left edge algorithm and timing diagram of the final circuit, for original and two modified version of lifetimes. In Fig. 2 (a) , only two storage elements are required according to the left edge algorithm. All the storage elements are implemented with positive edge triggered flip-flops and none can be replaced with latches; for R1 which stores a, h, i, k, the destination variable and source variable of the operation are assigned to same storage ele ment (a and h for FU1 and h and i for FU2 are assigned to R1) and for R2 which stores b, c, j, the storage elements is written at the consequent control steps-replacing the flip-flop with a latch in this case causes the output value of the functional unit change before that value is transferred to the storage element. In Fig. 2(b) lifetimes are modified a little bit to be able to use latches as storage elements. The death time of each variable is extended one control step more to hold the value one more cycle after the cycle it was last used. The timing diagram shows the signal status when all storage elements are implemented with negative level sensitive latches. Assuming that all the primary inputs and control signals are available from the negative edge of the clock signal , we can assert that all values will be available before the next negative edge of the clock causing no glitch at the positive edge of the clock.
Shaded region of the timing diagram represents spurious operation of functional units. In Fig. 2(a) , functional units evaluate 8 times in total and 5 of them are unnecessary evaluation caused by the sharing non-relevant variables in one register. Many of them are eliminated in Fig. 2(b) since sharing is avoided by extending lifetimes of variables. Further extending the lifetime of variable h and c, we can remove all spurious functional unit operation with the sacrifice of one more storage element as in Fig.  2(c) .
As we can see in the example, with some modification in the lifetimes of variables, it i s possible to synthesis circuits so that all storage elements can be implemented with latches and spurious operation of the functional units can be minimized. In next section, we will describe how to modify the lifetime to incorporate above features. 
IV. PROPOSED METHOD
The condition for a storage element to be safely implemented with a latch is that it should not change the output value until one more extra cycle after its death time. Two contrasting situations are depicted in Fig. 3 . Fig. 3(a) shows the case when the two variables (a and b) are allocated to one register so that the register changes its value as soon as the death time of the assigned variable arrives. In the first row labeled Reg, the output waveform of a register implemented in a flip-flop is shown. At the first cycle the register holds variable a, and the functional unit FU1 begins evaluation from the positive edge of the clock and after a mean time output is stabilized. At the next cycle, the FU1 output is transferred to another register x, and at the same time the register holds new variable b since the death time of the variable a arrived. In this case the flip-flop cannot be replaced with a latch since the latch may change its output before the relevant functiona l units transfer their results to the other storage elements. Fig. 3(b) shows the case when the two variables are allocated to one register so that the register changes its value only after one extra cycle since the death time of the assigned variable. In this case the storage element satisfies the replacement condition. 
Every flip-flop can be replaced with latch after the lifetime modification. The replacement results in the reduction of overall circuit area and power consumption. However, for the circuit in which the functional unit power consumption is comparable to the consumption in storage units, latch replacement can cause additional power consumption in functional units since the latch output changes value twice in one clock cycle; once at the positive edge of the clock and once during the clock is low. This increase in functional unit power consumption can be kept minimum with the help of technique in [3] . The key idea is to allocate storage units so that all functional units connected to each storage elements generate useful outputs. By extending the lifetime of each variable further to meet the following condition, the spurious operation in functional unit can be minimized. 
V. EXPERIMENTS & RESULTS
We experimented our scheme for two HLSynth92 benchmarks [4] , DIFFEQ and ELLIPF. DIFFEQ is a differential equation solver example and ELLIPF is an elliptical filter example. These benchmarks are given also with suitable test vectors. Fig. 4 shows the procedure used in our experiment. In the first step, input behavior code described in simple VHDL-like code is read in and data flow graph (DFG) is constructed. In the next steps, given the resource constraints, scheduling is done with ASAP algorithm and the functional units are allocated with the appearance order in the source description. Since the proposed scheme is only related with the storage allocation step of high-level synthesis, we didn't make much effort on scheduling and functional unit allocation step. In the forth and the fifth steps, the lifetime of each variable is analyzed and modified with the equations in the previous section. With the modified lifetime minimum number of storage units are allocated using left-edge algorithm. After all these, RTL description in Verilog HDL is generated using latches, multiplexers and functional units such as adder, multiplier as building blocks. Generated RTL code is synthesized using Synopsys Design Compiler for 0.65um standard cell library. After gate level simulation for the synthesized design, the power consumption is estimated using Epic PowerMill. The test vectors supplied with the benchmarks are used for the gate level simulation and power estimation. The last three circuits are ELLIP example with two, three and four adders. The bar graphs tagged 'Original ' represents the circuit with simple clock gating power management scheme. The next bars tagged 'PPM' represents the circuit with perfect power management scheme applied. All the storage elements in Original and PPM are implemented with flip-flops. In the third and forth bars tagged 'Latch' and 'PPM+Latch', the storage elements are implemented with latches by applying Equation (1) and Equation (2) respectively. In DIFFEQ example, the three low power schemes gradually decrease power consumption. But in ELLIP example where storage power consumption is dominant as we can see in Table 1 , PPM scheme increases power consumption by using more storage elements than the original circuit. In any cases circuit implementation with latches consumes less power compared to the circuit with other power management schemes. In Fig. 6 , the areas of the synthesized circuits including the estimated interconnection area are shown. Since the number of storage elements is increases when we use latches, the overall area is not so much reduces. When storage elements are more dominant than the functional units as in ELLIP example, the area may be increased on the contrary. 
VI. CONCLUSION
In this paper, we proposed a storage allocation method to enable the storage element being implemented with latch. In general, latch implementation of the circuits requires two phase clocking scheme to correctly pass variables to the next storage elements in consequent operation of the two storage elements. In the proposed scheme, all the storage elements can be implemented with latches using single clock scheme, since the lifetime of variables are modified to guarantee that the variables are valid until it is safely passed to the next storage elements. Since the lifetimes of the variables are slightly modified, the proposed scheme can be applied to any high-level synthesis systems.
According to our experiments, the proposed scheme is equally helpful either when the storage unit power is dominant or when the functional unit power is dominant, contrary to the perfect power management that is useful only when functional unit power is dominant. The power consumption in proposed low power scheme using latch is reduced to 39 ~ 65% of the power consumption in the circuit with simple power management by clock gating. Table 1 AREA AND POWER CONSUMPTION OF BENCHMARK CIRCUITS FOR EACH LOW POWER SCHEME
