Abstract-Specification of a concurrent system using CAOS (Concurrent Action Oriented Specifications) (CAOS) as illustrated by Bluespec Inc.'s Bluespec System Verilog provides a high abstraction level, effective concurrency management through atomicity, and powerful compilation to efficient RTL hardware. In this paper, we present two algorithms that make CAOS to RTL synthesis power-aware and produce RTL that can be synthesized into hardware competitive in terms of power/area/slack trade-off against the well-known industrial-strength power optimization RTL to gate-level synthesis tools. Our algorithms are simple and intuitive because the higher abstraction level allows one to easily analyze certain exploits that exist in the model. Discovering these opportunities at the RTL and lower levels require a much more involved circuit or gate structure analysis. We show through extensive experimental results that when a CAOS specification is compiled using our algorithms, the resulting hardware (without any additional gate-level power optimizations) has power/area/latency numbers comparable to those obtained by using existing tools for applying gate-level power minimization techniques. Also, the experiments show that in the absence of gate-level power optimizers such as Magma Blast Power or Synopsys Power Compiler, these algorithms show significant power reduction over standard Bluespec Compiler (BSC) for CAOS to RTL generation. And most importantly, our algorithms allow analyzing the affects of various power saving techniques in the early phases of the design cycle, thus avoiding the need to perform logic synthesis for such an analysis.
I. INTRODUCTION
The increase in complexity and size of the hardware designs leading to the surge in their power demand has made power consumption a critical measure for the success of any design methodology. Recently, high-level synthesis has shown promise in generating efficient hardware designs and has become an area of renewed interest in research community as well as industry. This paper presents techniques for the reduction of dynamic power in designs generated using highlevel synthesis from specifications at a behavioral abstraction called Concurrent Action Oriented Specifications (CAOS).
A large amount of dynamic power is consumed in the registers and the clocks of a design. Clock-gating of registers is a commonly used technique to reduce such power. In this paper, we present an algorithm that exploits the CAOS model of computation for efficient generation of gated-clocks during synthesis. This algorithm is general enough to apply to multiple clock domain specifications as well.
Another well known technique to minimize the dynamic power of a hardware design is to decrease the switching activity at the inputs of its functional units. During highlevel synthesis, such decrease in the switching activity can be targeted during the scheduling, allocation or binding phases of the synthesis process. However, even when the switching activity of various signals within a design is at its minimum, there is always a possibility of some unnecessary computation occurring in the design which can lead to unwanted power dissipation. A combinational computation is deemed unnecessary for a clock cycle if its output is not used for any useful purposes in that clock cycle.
Example
In CAOS, a design is expressed in terms of guarded atomic actions at a level of abstraction above RTL. Each action consists of two parts -a guard and a body. An action will be executed; that is, the output computed in its body will be used to update the state of the design if its guard evaluates to true in a clock cycle. For example, a CAOS-based description of GCD (Greatest Common Divisor) design can be written in terms of actions Swap and Diff as shown in Figure I .
Action Swap : g 1 ≡ ((x > y) && (y = 0))
x <= y; y <= x; Action Diff : g 2 ≡ ((x ≤ y) && (y = 0)) y <= y − x; Figure I . CAOS description of GCD design.
The execution semantics of the design shown in Figure I is as follows: g 1 and g 2 are the guards of actions Swap and Dif f respectively (x and y are the registers). The swap of the values in the body of action Swap occurs only when g 1 evaluates to true. The subtraction operation y − x in the body of action Dif f occurs whenever the values of x and/or y change but the assignment y <= y − x occurs only when g 2 evaluates to true. Hence, when g 2 evaluates to false, the combinational logic corresponding to the subtraction operation is involved in unnecessary computation.
As illustrated in Example 1, for a CAOS-based design computations occurring in the bodies of the actions whose guards evaluate to false should be avoided for the purposes of power savings. So far synthesis engines such as BSC [1] do not exploit this. An algorithm for efficiently blocking the switching activities to such un-used parts of a design is also presented in this paper. As explained later in the paper, synthesis using CAOS provides an efficient way of selecting the activation signal to block these unnecessary computations in a design.
The techniques demonstrate that decisions about the implementation of low-power techniques can be taken more efficiently during high-level synthesis than at the lower levels of abstraction (RTL and below). By using our techniques during CAOS-based synthesis, effects of various low-power techniques on different architectures of a design can be estimated much earlier in the design cycle (at RTL) and appropriate architecture and low-power technique can be selected leading to increase in the overall productivity. This enhances the achieved power savings and aids in easier and faster architectural exploration (instead of going through the whole power estimation flow upto the gate-level for each architectural choice). Furthermore, since such low-power techniques [2] are very commonly used in most real hardware designs, this implies that the implementation of such techniques during high-level synthesis generates designs which represent real designs more closely, as compared to the designs which are generated without using these low-power techniques.
The paper is organized as follows. Section II discusses related work which is followed by a short description of CAOS in Section III. In Section IV, an algorithm for efficient clockgating of registers during synthesis from CAOS is presented. Section V describes another algorithm targeting the reduction of unnecessary computations in a design by automatic insertion of gating logic. Experimental results obtained by implementing these algorithms in BSC and applying them to some realistic designs are presented in Section VI. Section VII summarizes the paper and discusses about future work.
II. RELATED WORK A comprehensive high level synthesis system for reducing power consumption in control-flow intensive as well as datadominated circuits is presented in [3] . [4] presents a power management technique targeted towards high-level synthesis of data-dominated behavioral descriptions. A framework for reduction of energy as well as transient power components during behavioral synthesis is presented in [5] . [6] discusses transient power management through the choice of appropriate architectures during high-level synthesis. In [7] , a scheduling algorithm which aims to maximize the idle times for functional units is proposed.
The above mentioned low-power techniques use CDFGbased models which inherently sequentialize the parts of the computation of a design in the form of computation threads. The work presented in this paper is based on the use of CAOS for high-level synthesis. An advantage of CAOS is that they do not lose the parallelism/concurrency inherent in the specifications and often allow the synthesis mechanism to infer more parallelism.
In [8] , [9] , other algorithms for low-power hardware synthesis from CAOS are described. Those algorithms are based on re-scheduling of the actions of a design and target the reduction of dynamic power and peak power components of a design.
Clock-gating of registers is a widely used technique for dynamic power savings. Another well known power optimization technique is Operand Isolation (also known as signal gating) which avoids unnecessary computations in a design by gating its signals in order to block the propagation of switching activity through the circuit. [10] discusses in detail about clock gating and how it can be helpful in low-power VLSI design. Approaches to reduce clock power based on RTL clockgating are discussed in [11] , [12] . [13] discusses automation of operand isolation during ADL(Architecture Description Languages)-based RTL generation of embedded processors. In [14] , a model is described to estimate power savings that can be obtained by isolation of selected modules at RTL. In this paper, we extend the use of such techniques to CAOS level of abstraction, particularly to their efficient application in designs generated during high-level synthesis from CAOS.
III. CAOS
In CAOS, a design is described in terms of guarded actions such that each action consists of two parts -a guard and a body. The guard is a condition associated with an action which should evaluate to true for that action to execute; that is, the value computed in the body of an action is used to update the state of the design only if its guard evaluates to true. The body of an action operates on the state of the system. In CAOS model of computation, designer explicitly instantiates all the state elements of the system (like registers, FIFOs, memories etc.). This model then undergoes synthesis to generate the RTL code [15] , [16] , [17] . CAOS is behaviorally higher in abstraction because of its handling of concurrency and synchronization of updates on the shared states of a design via atomic actions. BSC [1] is based on CAOS.
An action a in the concurrent action-oriented specifications of a design can be written as,
Here, s is the set of state elements of the designs such that {s 1 , s 2 , s 3 } ⊆ s; g(s) is the guard of action a. The body of the action contains three statements of the form s j = b j (s, t) where b j (s, t) is an expression which computes the next state of the system using current state s and current input t. Guard g(s) is also an expression which evaluates to either true or false.
For the GCD design shown in Figure I , set of the state elements can be denoted as s = {x, y}. Guards g 1 and g 2 are expressions which can be expressed in terms of other expressions as shown in Figure II .
g 2 = e 2 && e 3 ; Figure II . Expressions used in GCD Design.
An expression can denote complicated operations on the state of a design (in which case it can be composed of one or more other expressions) or it can be as simple as reading the value of a state element, for example, e 4 = x.read() in Figure II .
The actions in the CAOS are atomic in the sense that either all the computations corresponding to the body of an action finish successfully, or none of them executes. Two actions are said to be in conflict with each other if they update one or more of the same state elements. In the designs generated from the CAOS, multiple actions can execute in each clock cycle as long as the actions do not conflict among themselves and the behavior corresponding to their concurrent execution corresponds to at least one sequential ordering of the execution of these actions [16] , [15] . Recognition of this concurrent execution of the actions of a design by the compiler decreases the latency of the design. The system execution stops when no guard evaluates to true.
Hardware synthesis from the CAOS can be achieved by implementing each expressions used in the composition of a guard and body as a combinational logic. A control circuit that picks up a maximum number of actions to be executed concurrently in each clock cycle is also implemented. Figure  III shows the translation from the actions into the hardware. [16] shows that BSC can generate as efficient hardware as hand-coded RTL Verilog. IV. CLOCK GATING OF REGISTERS As already mentioned, technique of clock-gating of registers is commonly used at the RTL and lower levels of abstraction for reducing the Register/Clock Power of a design. Lets consider a design described using guarded atomic actions with the following notations - 
gatedClock(clk, en, rst):
Function which returns a gated clock (generated using a latch and an AND gate). It takes clock clk, enable signal en and reset signal rst as the inputs.
C: Set of subsets of R where each subset contains registers having same clock; that is,
Set of subsets of C i ∈ C where each subset contains registers having same enable signal; that is,
T : Set of subsets of E j ∈ E where each subset contains registers having same reset; that is,
Algorithm 1 performs efficient clock-gating of registers for CAOS-based designs. In CAOS, registers which are updated by the same set of actions will have the same enable signal. This implies that an enable signal of a register is a disjunction of guards of all the actions which can update it. During clock-gating of registers, same gated-clock can be passed to registers having common enable signal (assuming same clock and reset signal). Thus in CAOS-based designs, guards of the actions provide an efficient way of selecting which registers should share the gated-clocks. Based on this idea, Algorithm 1 generates and assigns gated-clock to each register of a design. It efficiently handles designs with multiple clocks and reset signals. In designs with single clock and single reset signals (most designs fall in this category), gated-clocks will be assigned to the registers based on their enable signals, and hence the total number of generated gated clocks will be equal to the number of distinct enable signals.
Compute enable signal Enr; end for Compute C = {Ci : Ci is a group of registers having same clock CLK};
Compute set E = {E j : E j is a group of registers having same enable signal EN}; for all E j such that E j ∈ E do Compute set T = {T k : T k is a group of registers having same reset signal RS};
replace clocks of all the registers in T k by gated clock gCLK; end for end for end for
In CAOS, the values of various guards of the design are computed in each clock cycle and are used to select the actions which can be executed in that clock cycle. Thus the combinational logic corresponding to a guard g i is involved in useful computation in every clock cycle. On the other hand, expressions in the bodies of various actions also compute values in each clock cycle but only some of those values (having corresponding g i evaluate to true) are selected to update the state of the design. The computations corresponding to the unselected values can be considered as unnecessary computations since those values are not used to compute the next state. Avoiding such computations will result in reduction of the switching activity of the design leading to dynamic power savings.
As mentioned earlier, the guards and the bodies of the actions are composed of one of more expressions. Algorithm 2 parses the expressions corresponding to the bodies of various actions and inserts gating logic (using AND gate or LATCH) at the appropriate places in these expressions. Each expression gets translated into a combinational logic during synthesis. The gating logic is inserted such that the inputs of these combinational blocks are isolated/gated using guard of the corresponding action as the activation signal of the gate. Thus computations are triggered across a combinational logic only when its output is used in some further computations or to update the state of the design. For efficient gating of various signals of a design, two problems need to be solved -1) Insertion of gates at the appropriate points -Algorithm 2 targets this problem by parsing through various expressions used in the bodies of the actions of a design and inserting gates such that the unnecessary computations are minimized. While inserting the gates, expressions used in the guards are not affected since outputs of such expressions are involved in useful computation in each clock cycle. Sharing of common expressions among various actions is also taken into account while inserting the gating logic.
2) Selection of activation signal -The guard of each action can be used to decide if the computation occurring within a combinational logic will be used in a clock cycle. Thus for a design described using CAOS, the activation signals required for the gating logic are already exposed in the form of these guards, thus making the implementation of gating logic efficient since no separate circuit is required for the generation of these activation signals.
Consider the following notations -G: Set of expressions corresponding to guards of the actions of the design ( g i ∈ G ).
B i : Set of expressions (including the ones involved in the composition of other expressions) used in the body of an action a i ∈ A.
E g : Set of expressions (including the ones involved in the composition of other expressions) used in various guards of the design.
For the GCD design shown in Figure I, Lets define the following functionssubExprs(e): Function that returns the set of expressions that are used to compose expression e. Such expressions can also be considered as the inputs to expression e.
isValueRead(e): Function that returns true if expression e represents an access (reading a value) to a memory element.
rank(e):
Function that returns the number of actions that share an expression e. It can be defined as,
where
Function that inserts the gating logic. It returns a new expression e' which evaluates to e when g i is true; or evaluates to zero otherwise. Such an expression can be composed using any of the following definitions -1. Using AND gate (e = e && g i ) -Gating using AND gates is mainly suitable for designs where guard g i doesn't change frequently; that is, actions of the design do not execute frequently. This is because an AND gate will change its output when g i transitions from high to low, thus triggering some unnecessary computation in the combinational logic during such a transition. 2. Using LATCH (e' = Output of a latch having input as e and enable as g i )-Gating using latches is suitable for designs where actions execute frequently. This is because a latch will hold the output value if its enable signal is low and no transitions occurs at their outputs when g i changes from high to low.
isolate(e, i): Function that returns a new expression after inserting appropriate gating logic in expression e. It uses guard of action a i as the activation signal for the gating logic.
The algorithm starts by selecting an action a i and parsing through each expression used in its body. For each such expression e the algorithm makes call to function isolate(e,i) which handles the following cases -1) Expression e is also used in composing at least one guard (e ∈ E g ) -In CAOS, each guard g i ∈ G is involved in useful computation in every clock cycle.
Since e ∈ E g , its output will also be used in each clock cycle. Thus no further parsing of e is required; that is, the expressions which are used to compose e need not be parsed. In this case, a new expression e = gate(e, g i ) which incorporates the gating logic is created. Then, e is replaced by e' in the body B i of the selected action a i . This make sure that the guards of the actions are not affected by the insertion of the gating logic since they will use the output of expression e for proper evaluation. 
//Computes all expressions used for composing e. function allSubExprs(e) S = subExprs(e);
for all e such that e ∈ S do X = allSubExprs(e); T = T X; end for return T ; . This way, the other action(s) which are also using expression e will not be affected. On the other hand, if isV alueRead(e) returns false, then a new expression e = e is created. The idea is to avoid sharing and create a separate combinational logic for the body of the selected action a i which can be gated independently without affecting other actions. In most cases, such duplication of combinational logic increases the potential of gating these blocks independent of each other, thus leading to higher power savings. After creating e', expression e is replaced by e' in B i and expressions which are used to compose expression e' are parsed further.
A. Other Versions of Algorithm 2
1) Version 2: Case 2 of Algorithm 2 occurs when expression e is used in the body of only one action; that is, rank(e) == 1. As mentioned earlier, if isValueRead(e) returns true, then such an expression e is gated without any further parsing of the expressions used in its composition. A different implementation of Algorithm 2 can be obtained by gating e, and then continue parsing the expressions used in its composition in order to look for more opportunities of isolation. Such opportunities of isolation may arise in cases when a value is read from a memory element based on the value of some other argument, and hence the expressions corresponding to both these values can be isolated.
2) Version 3:
Another possible implementation of Algorithm 2 can be obtained by modifying its Case 3. As mentioned earlier, Case 3 occurs when expression e is used by the body of at least one more action; that is, rank(e) > 1. Let A' be the set of all such actions using e. Let G' be the set of expressions corresponding to the guards of all the actions in A'.
In Case 3, expression e' is used to replace expression e as part of isolation. When isV alueRead(e) returns true, e' is computed as e = gate(e, g i ), otherwise e = e. Instead, in both these cases, expression e' can be evaluated using a new activation signal a which can be given as,
Thus, instead of using the guard of an action as an activation signal to evaluate e', the disjunction of all the expressions in G' (corresponding to the guards of all the actions using expression e) can be used as the activation signal. For example, in case AND gates are used for isolation, expression e' can be created as, e = e && a. 
VI. EXPERIMENTS AND RESULTS
We implemented the proposed algorithms in Bluespec Compiler (BSC) and tested them on various realistic designs. The selected designs vary in their nature, size and complexity. Each design is first synthesized to RTL using BSC which converts a CAOS-based description of a design into RTL Verilog code. (version X-2005.06-9) to verify the functional behavior of the designs. The synthesized designs are checked to make sure that they meet the timing requirements. Power estimation is done at both RTL and gate-level for which the generated Verilog design files and the simulation activity files (in value change dump (vcd) format) are passed to Sequence PowerTheater (version R2006.1) [18] . Both RTL and gate-level experimental results for Algorithm 1 and Algorithm 2 are presented below.
A. Algorithm 1
Dynamic power of a design is composed of its Combinational Power, Register Power and Clock Power. Table I and Table III show the reductions (as fractional change from original power) obtained in Total Power and Register+Clock Power using Algorithm 1. The numbers shown are obtained by performing power estimation at the gate-level. Table I shows the results obtained when Blast Create is used as the logic synthesis tool. Since clock-gating of registers can also be handled efficiently by logic synthesis tools, in Table I we compare the power savings achieved using Algorithm 1 against the savings obtained by turning on Blast Create's clock-gating feature.
Insertion of the extra clock-gating circuitry for power savings of a design is associated with corresponding increase in its area. Table II shows the area penalties (as fractional change as compared to area of the original design) reported by Blast Create on using Algorithm 1 for power savings. Corresponding power and area numbers when Power Compiler is used as the logic synthesis tool are shown in Table III and Table IV As shown in Table I and Table III , Algorithm 1 consistently showed significant power savings in all the designs. Larger power savings were obtained for AES and DMA designs which consist of several registers that are not updated frequently, thus saving significant power by clock-gating of registers.
Comparison of the results shows that Algorithm 1 is competitive in the sense that for most designs power saved by using Algorithm 1 is very close to the savings achieved by using Blast Create's or Power Compiler's clock-gating. For some designs like Vending Machine (VM) ( Table I Table III) , Greatest Common Divisor (GCD) ( Table I) and FSM (Table III) , Algorithm 1 even performs better than the logic synthesis tools. This can be attributed to the fact that CAOS, which is at a higher level of abstraction than RTL, can facilitate in taking efficient decisions during the application of low-power techniques. Table V to Table VIII show the reductions obtained in Total Power and Combinational Power of various designs using Algorithm 2 (and its versions) along with the associated effects on the area of those designs. The reported numbers (gate-level) are obtained with AND gates used as the gating logic in order to minimize the power and area overheads associated with the extra circuit inserted by Algorithm 2. For each design, all power and area results are shown as fractional change as compared to the original design.
B. Algorithm 2
Logic Synthesis Using Blast Create: In Table V we show the power savings achieved using Algorithm 2 and its versions when Blast Create is used for logic-synthesis. Total power savings of upto 25% (AES design) on using Version 2 of the algorithm demonstrates that Algorithm 2 can be successfully used to generate power-efficient designs. AES design, which is an implementation of the Advanced Encryption Standard (AES) algorithm, consists of 11 actions only some of which were executing in each clock cycle. Thus, Algorithm 2 showed significant power savings for AES design. Other designs like DMA, FSM, Vending Machine (VM) also show a decrease in the total power consumption on using Algorithm 2 as shown in Table V . Note that in most cases, Version 1 of Algorithm 2 shows maximum power savings. But for the AES design, Version 2 performs even better than Version 1. This is because, as explained earlier, Version 2 of the algorithm looks for extra opportunities of isolation by further tracing the expressions involved in accessing the value of a memory element. Thus, depending of the design either of these versions can be used for power savings.
On the other hand, as shown in Table V , for UC design (an implementation of Bus Upsize Converter) we noticed an increase in its power consumption on using Algorithm 2. Further analysis showed that for designs in which most actions execute frequently, using Algorithm 2 may increase their power demand. This is because the inserted gating logic also consumes some additional power and if the combinational power saved is less (due to frequent execution of most combinational logic) than this extra overhead, then the overall power of the design will increase. We noticed that for most CAOS-based designs, an action which can be successfully gated (an action is said to be successfully gated in a clock cycle if its guard evaluates to false in that cycle) for more than two consecutive clock cycles will contribute to power savings on using Algorithm 2. On the other hand, an action which executes frequently (and thus can not be gated for large number of consecutive cycles) may result in an increase in the design's power consumption.
Power savings obtained by using Algorithm 2 are also associated with a corresponding increase in the area of a design due to the insertion of extra gating logic. Table VI reports the associated area penalties for each design on using Algorithm 2 (the numbers are obtained from the area reports generated by Blast Create). Maximum area penalties are seen for the FSM design. Since the chosen FSM consisted of a large combinational part, using Algorithm 2 for power savings resulted in the insertion of significant gating logic, thus increasing the area of the design. Hence, application of Algorithm 2 involves a power-area trade-off. Addition of extra gating logic for the purposes of power savings also affects the critical path slack for a design, thus affecting its performance. This is because the computation corresponding to the guard and the body of an action are forced to occur sequentially (as opposed to concurrent execution in the original design) due to the added gating logic. Thus, on using Algorithm 2 the slack of a design should usually shrink. However, we noticed that for some designs Algorithm 2 actually resulted in some slack improvement. This can be attributed to the fact that the addition of extra AND gates enables some additional Boolean optimizations during logic synthesis of these designs. Such optimizations may also result in slight area reduction in some cases.
Logic Synthesis Using Power Compiler: Table VII and  Table VIII As shown in Table VII , power savings achieved using Algorithm 2 and its versions are comparable to the savings obtained using Power Compiler's operand isolation. Version 2 of the algorithm saves maximum power for the AES design, whereas for the FSM design the original Algorithm 2 results in most power savings. Version 3 performed slightly better than others in case of UC design. The results show that for most designs either Algorithm 2 or its Version 2 provide better power savings. Thus, depending on the design appropriate version of Algorithm 2 can be used.
Note that, as shown in Table VII , the power consumption of some designs remain almost same as the original design on using Algorithm 2 or Power Compiler's operand isolation. We need to develop a better understanding of Power Compiler's synthesis process to reason about this behavior.
Another Refinement: As mentioned earlier, in Algorithm 2 guards of various actions are used in the gating logic (as the activation signals) for isolating a part of the design. In the real hardware, any unnecessary switching activity occurring in the guards of a design (before the signal settles down to a value) will result in extra power consumption. In order to avoid the propagation of such switching occurrences in various guards, their values can be passed to the gating logic only at the negative edge of the clock.
We implemented such a refinement of Algorithm 2 using latches which pass the values of the guards to the gating logic only at the negative edge of the clock. We noticed that for some designs such a refinement helps to further decrease the power of the design at the cost of extra area overhead incurred by the addition of latches. Also, such a use of latches results in increasing the clock period of a design since the values of guards are only passed in the second half of the clock cycle, thus leaving less time for various computations to complete.
C. RTL Power Estimation
Instead of performing logic synthesis for analyzing the power consumed by a design, Power Theater can be used for RTL power estimation so that the affects of various lowpower techniques can be evaluated earlier in the design cycle. This aids in faster architectural exploration. Table IX and  Table X show the power savings achieved using Algorithm 1 and Algorithm 2 respectively when power estimation is done at the RTL.
Comparison of the RTL power numbers against the gatelevel numbers shows that RTL power estimation can be successfully used to analyze the affects of Algorithm 1 and Algorithm 2 on the power consumption of most designs. However, for some designs like AES (Table I) , FSM (Table I) and GCD (Table III) there were significant differences in the absolute RTL and gate-level power numbers. However, as shown in Table IX and Table X , even for these designs the fractional power savings at RTL and gate-level are similar, thus supporting the fact the RTL power estimation can be successfully used for analyzing low-power techniques at a level above RTL.
Thus, applying such low-power techniques earlier in the design (above RTL) aids in faster exploration and thus increases designer's productivity. Moreover, in some cases applying these techniques at a higher level of abstraction helps in taking efficient decisions to further increase the achieved power savings. In other words, the results show that applying lowpower techniques above RTL may aid in extra power savings in addition to offering the advantage of earlier assessment of the affects of these optimizations on the power consumption of a design.
VII. SUMMARY AND FUTURE WORK
Algorithm 1 provides an efficient method for the generation of appropriate gated-clocks for registers of a design. Such assignment of gated-clocks becomes difficult, and hence may be inefficient at lower levels of abstraction. Algorithm 2 performs automatic insertion of gating logic for combinational power reductions. It exploits the fact that if an action is not executed in the present clock cycle, then the computation occurring in its body can be avoided for the purposes of power savings. However, application of Algorithm 2 is associated with the following issues -1) Algorithm 2 may increase the power consumption for designs where most actions are executing frequently. 2) Power savings obtained using Algorithm 2 are associated with a corresponding increase in the area of the design.
These issues can be resolved by refining Algorithm 2. Instead of inserting the gating logic in all the actions of a design, gating logic can be implemented only in those actions of the design that do not execute frequently. Such refinement can be guided by feeding back the execution traces of the actions of a design to BSC. We are planning to implement such a refinement as part of the future work. This will also involve an analysis of a design spec at the CAOS level to select appropriate actions in which gating logic should be inserted.
The presented experimental results demonstrate the effectiveness of using proposed algorithms for dynamic power reduction of CAOS-based designs. As expected, the proposed algorithms result in the power savings of most designs at the cost of corresponding area/latency penalties. The results obtained using these algorithms are comparable to those obtained after similar power optimizations are done using commercial logic synthesis tools. Hence, applying such power optimization techniques at CAOS-level facilitates in earlier (above gatelevel) assessment of the effects of such techniques on the power, area and latency of a design, thus aiding in faster architectural exploration and enhanced productivity.
