Many techniques for power reduction in advanced RTL synthesis tools rely explicitly or implicitly on observability don't-care conditions. In this article we propose a systematic approach to maximize the effectiveness of these techniques by generating power-friendly RTL descriptions in behavioral synthesis. This is done using operation gating, that is, explicitly adding a predicate to an operation based on its observability condition, so that the operation, once identified as unobservable at runtime, can be avoided using RTL power optimization techniques such as clock gating.
INTRODUCTION
In recent years, power dissipation has become an increasingly critical issue in VLSI design. A number of techniques for power reduction have been developed in advanced RTL synthesis tools. While some techniques try to replace powerhungry devices with their power-efficient counterparts at possible costs in performance and/or area, other orthogonal approaches reduce power by avoiding unnecessary operations, using techniques such as operand isolation, clock gating, and power gating. Observability Don't-Care (ODC) conditions, introduced by the logic synthesis community (for example, De Micheli [1994] , Devadas et al. [1994] , Gajski et al. [2002] ), play an important role in the identification of unnecessary operations in a Boolean network. Isolation cells can be inserted at inputs of a functional unit when its result is not observable [Münch et al. 2000] . For clock gating, the simplest approach is based on stability conditions [Fraer et al. 2008] : when the value stored in a register does not change, its clock can be gated. It is recognized that exploiting ODC conditions in clock gating can lead to significantly more power reduction by avoiding unobservable value changes in registers [Babighian et al. 2005; Benini and De Micheli 1996; Benini et al. 1999; Fraer et al. 2008; Ohnishi et al. 1997] . Figure 1 shows a simple case where ODC conditions in an RTL design are used for clock gating: when A is greater than 5, the output of the multiplexer is equal to A, and thus B 2 computed by the multiplier is unobservable; then we can get the implementation shown on the right where clock is gated for some registers and activity of the multiplier can be avoided.
The problem of computing ODC conditions in a sequential RTL model has been approached in a number of ways. Some prior work views the circuit as a Finite-State Machine (FSM) and calculates the exact ODC condition for every bit using formal methods [Benini and De Micheli 1996; Benini et al. 1999] . However, the number of states in an FSM can be exponential in terms of the number of registers. Thus, the exact approach can be prohibitively expensive for moderately large designs. Methods developed in practical systems are often conservative but more scalable, without a thorough analysis of the FSM. The work in Münch et al. [2000] assumes that every value stored in a register is observable and only performs analysis on combinational parts of the circuit. Similar assumptions are used in many commercial products. The approach in Ohnishi et al. [1997] relies on special patterns in the Hardware Description Language (HDL) code to compute ODC conditions, and thus the quality of result depends on the coding style. The algorithm in Babighian et al. [2005] detects ODC conditions based on datapath topology, using backward traversal and levelization techniques. A more recent work [Fraer et al. 2008] shows that more ODC conditions can be uncovered in the results of Babighian et al. [2005] by propagating ODC conditions that are already utilized in other parts of the design (possibly discovered manually by the designer). Wang and Roy [2003] propose to perform observability analysis at the behavior level, in order to discover opportunities for power optimization in an existing RTL design. The approach relies on the branching structure of the program, but ignores correlation between Boolean values. All these methods are shown very effective in practice. However, it is not clear how much opportunity for power optimization still exists due to obvious pessimism when computing ODC conditions.
Even with a powerful tool that could calculate the exact ODC condition for every signal in an RTL model efficiently, the opportunity for power saving is still limited by the existing RTL design. In the example shown in Figure 2 , the comparison is performed later than the multiplication, and the clock gating technique in Figure 1 cannot be applied.
This suggests that huge opportunities remain unexploited at a higher level where there is freedom in choosing a good RTL structure among many alternatives of the same functionality. A more sophisticated example is shown in Figure 3 , where different schedules with the same latency imply different opportunities for avoiding operations. Note that there is a select instruction in the behavior code (corresponding to a multiplexer in the circuit), and thus the evaluation of some values are not necessary depending on which value is selected as the output. In the first schedule, two multiplications (v1 and v2) are always executed. In the second one, when v7 is evaluated as false in the first step, v9 will be equal to v3, and values including v1, v2, v5, and v6 are not needed because they will not influence the output. By scheduling instructions intelligently and imposing guarding conditions (predicates) to operations based on ODC conditions in the resulting RTL design (this is referred to as operation gating in this article), we can effectively restructure the control flow and get different equivalent C codes as shown on the right side; the resulting · 4: 5 implementation can have different power under the same performance constraint. A powerful behavioral synthesis tool could explore such higher-level opportunities to generate and redistribute ODC conditions, whereas an RTL synthesis tool is unable to explore this design space; it can at most take advantage of available ODC conditions for a fixed schedule.
It then becomes an interesting problem how a power-friendly RTL model can be generated in order to maximize the effectiveness of ODC-based power management techniques. In this work, we study this problem systematically. Our contributions include the following.
-We present a formal framework for observability analysis at the behavior level. We introduce several observability measures and explain their relations. -We describe an efficient method to compute observability at the behavior level. We first present an abstraction of a dataflow graph using only black boxes and generalized select operations. Then, a method is developed to compute the smoothed behavior-level observability based on several theorems. The method is exact for dataflow graphs with black box abstraction. We also allow certain forms of knowledge about inputs and other instructions to be considered. -We present a behavioral synthesis flow for power optimization using operation gating, guided by behavior-level observability, and demonstrate the effectiveness of our approach in real-word designs. To the best of our knowledge, this is the first time that behavioral synthesis is guided by a comprehensive observability analysis for power optimization.
The rest of this article is organized as follows. In Section 2 we describe our assumptions on the behavioral synthesis system and the target hardware architecture. Behavior-level observability, as well as its approximation under the proposed black-box abstraction, is introduced in Section 3. An efficient algorithm for observability analysis at the word level is proposed in Section 4, based on several theorems. We briefly describe our approach to observabilityguided power optimization in scheduling in Section 5 and report experimental results in Section 6. Section 7 discusses related work, followed by concluding remarks in Section 8.
BACKGROUND
In this section we describe a behavioral synthesis system which serves as the context for the discussions in the following parts of this article. In our behavioral synthesis system, a compiler front-end parses and optimizes behavioral descriptions in high-level languages (like C/C++) and generates descriptions in an intermediate representation, which can be represented as a Control/Data Flow Graph (CDFG). A CDFG is a graph G = (V, E), where each node v ∈ V represents an operation (also called an instruction), and each directed edge e ∈ E represents a data dependency or a control dependency. Operations are scheduled statically by the synthesis tool. For each operation v, an [Gajski et al. 1992] can be constructed once the scheduling variable for every operation is decided [Cong and Zhang 2006] . The FSMD model can be subsequently translated into an RTL model through a binding process which maps operations to functional units, variables to storage units, and data transfers to interconnects. The overall flow is illustrated in Figure 4 .
When we perform operation gating, a predicate is added to an operation based on its observability. Unlike general-purpose processors where the syntactic form of a predicate is usually very limited (e.g., including only one or two dedicated predicate registers), the target architecture of a behavioral synthesis system can be customized, thus allowing much more flexibility in the Instruction-Set Architecture (ISA), the form of a predicate in particular. Here we consider a target architecture where the predicate can be any expression of an arbitrary number of Boolean values (literals). This is based on the fact that evaluation of Boolean expressions on application-specific hardware can often be done in a very short time (typically much shorter than a clock cycle) with a relatively small overhead (a few logic gates and wires). However, we limit the form of a literal to a Boolean value because expressions involving non-Boolean values can be much more expensive to evaluate; if the logic of the predicate involves non-Boolean values, additional operations (such as truncation, comparison) are needed to obtain Boolean values from non-Boolean values.
Observability conditions associated with different levels of abstractions can be different. In the scheduling process of transforming a CDFG into an FSMD, behavior-level observability conditions are translated into FSMD-observability conditions (observability under a given schedule, more precisely defined in · 4: 7 Section 5). In the example in Figure 3 , a behavior-level observability condition for v1 is v6v7v8, that is, v1 is observable only when v6, v7, and v8 are all true. However, the evaluation of v1 can never be avoided in the first schedule, because the observability of v1 is always unknown when it is evaluated and conservative decisions have to be made to guarantee correctness. The second schedule is better in the sense that we could avoid evaluating v1 when v7 is known to be false, because v6v7v8 will then be false even when v6 and v8
are not yet evaluated. Here we say the FSMD-observability condition of v1 is v7. Clearly, different schedules imply different ODC conditions on their associated FSMDs, and it is not always possible to postpone every instruction until its behavior-level observability condition is known, due partly to performance constraints.
The problem considered in this article can be described as follows: Given a CDFG and profiling information, as well as the cost (average power) for executing each instruction, find a schedule that leads to the smallest expected total cost after operation gating, subject to data-dependency constraints and a latency constraint.
BEHAVIOR-LEVEL OBSERVABILITY

Observability for a General Function
The observability of a function f (x, y) with respect to part of its variables x is a Boolean-valued function of the rest of the variables y; the observability is true for values of y which makes it possible that changes in x are observable at the output.
Informally, O x f is a necessary condition about y for x to be observable. It is clear that Definition 3.1 is compatible with the definition of the observability condition for Boolean functions.
When only part of the variables can be used for observability computation, we usually need to make conservative decisions about other unknown variables by using a necessary condition of the exact observability. This can be done using projection.
Informally, P y 1 h(y 1 ) is weaker than h(y 1 , y 2 ), but it is the strongest necessary condition for h(y 1 , y 2 ) with respect to y 1 .
For a dataflow graph g, let x ∈ X be the value whose observability is being considered. We cut the edges from the operation that computes x, and treat x as a primary input. Let V ∈ V be the set of all the other primary inputs in g, among which B ∈ B is the set of Boolean values, and C ∈ C is the set of other values. Let OUT ∈ OUT be the output. We view the program as a function g : X × B × C → OUT. 
Dataflow Graph Abstraction
Different from a Boolean network where each vertex is a simple Boolean operator and each edge is a single-bit signal, a dataflow graph represents program behavior at the word level. Operations in a dataflow graph can be either Boolean operators like those in a Boolean network (such as and, or, not), or more complex ones at the word level (such as add, mul, div). A special operation is select, whose output is equal to one of its data inputs, depending on the control input. While it is theoretically possible to decompose all complex wordlevel operations into Boolean networks and compute BL O(x) using techniques for observability computation in Boolean networks, the approach is often computationally intractable due to the large size of the network. Moreover, even if we have an efficient way to compute observability in the large Boolean network, the observability condition is likely to be very complex, involving bits from word-level signals. This is not quite useful for operation gating, because only Boolean values can be used in predicates, according to our assumption in Section 2.
To get reasonable observability conditions efficiently at the word level without elaborating all details, we propose to model complex operations (i.e., all operations that take a non-Boolean value as an input, excluding select) in the dataflow graph as black boxes. A black box has fixed input/output bitwidths, and it implements some non-Boolean function whose semantics is being ignored in our analysis. In other words, a black box can be instantiated as any function with the specified input/output bitwidths. Under this abstraction, no knowledge about the complex operations can be used, and the goal is to obtain observability conditions that hold regardless of the instantiations of black boxes. Consider the general case where a dataflow graph has m black boxes B 1 , B 2 , . . . , B m . The observability condition of a value x depends on the instantiation of each black box. Let BL O f 1 ,..., f m x be the observability condition of x when B i is instantiated as a function f i . We are interested in the strongest necessary condition for BL O(x), without knowing anything about
The result is defined as the smoothing of BL O(x) with respect to all the black boxes. Denoting the result of smoothing as
(1)
Please recall that the smoothing of a Boolean function g( Micheli [1994] . Note that our definition of smoothing over black-box functions in Eq. (1) is compatible with the original definition in De Micheli [1994] . In fact, they are the same if we view variable x i as the output of a 0-input-1-output black-box function. Clearly, S BL O(x) is weaker than the exact BL O(x) due to the absence of knowledge about operations that are modeled as black boxes. However, the result of such an operation typically depends on all of its operands (with rare exceptions, such as the case when one of the inputs of a mul operation is 0), and correlations between different non-Boolean values are difficult to analyze and represent in a thorough way at the word level. Thus, we consider the black-box modeling a reasonable abstraction for behavior-level observability analysis.
We also generalize the select operation to facilitate observability analy-
and generates one output z. All control inputs are Boolean variables, and all data inputs are of the same bitwidth.
4: 10
Here s 1 , . . . , s l is a set of orthonormal Boolean bases; that is,
It is easy to note that the traditional select operation is a (1, 2)-select with s 1 (b 1 ) = b 1 and s 2 (b 1 ) = b 1 . This generalized select allows selecting from multiple data inputs, and absorbs Boolean functions.
With the approach of black-box modeling and the preceding generalization of select, there are only two types of operations remaining in the dataflow graph: black box and generalized select. We make the following restrictions to facilitate discussion in the following parts of the article. Note that a valid dataflow graph can always be normalized to a form that conforms to these restrictions.
(1) If a Boolean variable is used by a select operation as a control input, it cannot be used by black boxes or by any select operation as a data input. In other words, values are divided into two categories: either used purely as control signals (by select operations), or purely as data inputs (by black boxes and select operations).
-If a Boolean variable b is a primary output, introduce a (1, 2)-select that selects constant 1 if b is true and constant 0 if b is false. -If a select operation takes Boolean variables as data inputs, we can always replace the select operation with a Boolean expression (which is eventually absorbed in another select). (2) For each select operation, all of its inputs are distinct.
-If a Boolean variable appears more than once in the list of control inputs of a select operation, the number of control inputs can be reduced, and the selecting logic can be simplified. -If the same data value is selected in more than one case, the number of data inputs can be reduced, and the cases where the same value is selected can be merged. (3) Each data input of a select comes from a black box.
-Primary inputs and constants are regarded as outputs of black boxes.
-When the result of a select n 1 is used by another select n 2 , we can simply replace n 2 with a combination of n 1 and n 2 , so that the result of n 1 is not used by another select. Note that n 1 may still be present after the transformation if it has other uses.
For the example dataflow graph shown in Figure 3 , operations that evaluate v1, v2, v3, v4, v5, v6 , and v7 are all modeled as black boxes; the operation that evaluates v8 is absorbed in the (2, 2)-select operation v9; the selection function for the (2, 2)-select is s v 3 (v 6 , v 7 ) = v 6 v 7 , s v4 (v 6 , v 7 ) = v 6 v 7 . The dataflow graph is transformed as shown in 
BEHAVIOR-LEVEL OBSERVABILITY ANALYSIS
As mentioned previously, computing the exact observability condition requires nontrivial effort, essentially breaking all values into individual bits and applying techniques for Boolean networks. In this section, we describe an algorithm to compute S BL O, that is, observability under the abstraction using black boxes described in Section 3. The algorithm propagates and manipulates S BL O directly, and thus it avoids the trouble of considering the instantiations of black boxes.
Review of Observability Computation in a Boolean Network
Our method for computing S BL O is based on a technique for observability analysis in Boolean networks [De Micheli 1994] . Here we give a brief review of the algorithm. For simplicity, we consider the case when the Boolean network has only one primary output. The observability of a node in the Boolean network with regard to multiple primary outputs can be computed by summing up its observability conditions with regard to each individual primary outputs. The algorithm labels the observability conditions on nodes and edges in reverse topological order. It proceeds with three kinds of actions.
Initialize. For the primary output z under consideration, set BL O(z) = 1. For all other primary output w, set BL O(w) = 0. Propagate node observability to its input edges.
Merge edge observability to get node observability. If a value y is used m times as y 1 , . . . , y m , when y is visited, we already have the edge observability conditions BL O(y i ), i = 1, . . . , m. These edge observability conditions are computed independently from downstream operations; thus the correlation that y 1 = y 2 = · · · = y m needs to be considered when merging edge observability conditions. Eq. (6) is used to derive node observability.
BL O(y)
= m i=1 BL O(y i )| y i+1 =···=y m =y (6)
Observability Analysis with Black Box
In this subsection, we show that for a Boolean network composed of black boxes and generalized select operations, if the requirements in Section 3.2 are satisfied, we can compute and propagate the smoothed observability S BL O directly, using an approach similar to that for Boolean networks. For simplicity, we first consider the case with only one output (an output is an operation whose result may be used outside the current dataflow graph, or by the next loop iteration); the observability with regard to multiple outputs can be obtained by considering these outputs one by one and summing up the smoothed observability conditions with regard to different outputs for the same value. We still have the three types of actions, namely initialization, propagation and merging, among which the initialization is trivial. In the following, we develop theorems that give rules for propagation through black boxes and generalized select operations, as well as rules for merging of control signals and data signals. Proof to these theorems can be found in the Appendix. Based on the preceding theorems, we develop a process called observability analysis to compute the smoothed observability conditions for all operations. The input program is thoroughly optimized using classic compiler optimizations including control-flow optimization and if-conversion (using the select instructions); a dataflow graph is formed for each acyclic region, and preprocessed into a graph with black boxes and generalized select operations. The algorithm for observability analysis is shown in Algorithm 1. For the example dataflow graph with black boxes and the generalized select operations shown in Figure 5 , after applying Algorithm 1, we get the smoothed behavior-level observability condition shown in Table I .
THEOREM 4.2 (PROPAGATE THROUGH SELECT). For a
Since the expression of S BL O contains only Boolean control signals, S BL O is also a BBL O (and is also a BL O), according to definition. However, it is not necessarily the strongest BBL O possible, due to the introduction of black boxes. The output of a black box is completely unknown, and correlations between values are completely lost after a black box. While we consider this black-box model very useful to enable analysis at a higher level, certain knowledge about values in the dataflow graph, once uncovered, can be employed to strengthen the condition. To do that, we are mostly interested in correlations between Boolean values, for example, (x == 3) implies (x < 10). Although capturing exact relations between Boolean values is nontrivial, at least some knowledge can be discovered and exploited. Such techniques have been developed in compilers [August et al. 1999] , and can be directly applied in our algorithm.
For the example in Figure 3 , let us assume that we know input c is always an odd number. By analyzing the observability-propagating instructions, it can be asserted that the two Boolean values, v6 = (a * a + b * b == 100) and 
propagation from black box (v1) to its input
propagation from black box (v2) to its input
edge observability for b c S BL O(c v3 ) + S BL O(c v7 ) = v6 merge edge observability for c
Here c v3 denotes the edge observability for the edge from v3 to c. If a node has only one outgoing edge, the edge observability and node observability are the same. v7 = (a == c) cannot be true simultaneously, because the set of integer values of a that satisfies a * a + b * b == 100 is {0, 6, 8, 10}, all elements of which are even, so v7 = (a == c) will be false if v6 is true. Thus we have the knowledge v6v7 = true, which can be used to simplify the conditions. For example, we have BL O(v3) = v6v7; with that knowledge, we have BL O(v3) = v6v7∧v6v7 = f alse, that is, we find that v3 is always unobservable when c is odd.
SCHEDULING FOR OPERATION GATING OPTIMIZATION
In this section we discuss how the behavior-level observability obtained by Algorithm 1 can be used to improve the effectiveness of operation gating in scheduling.
Observability Under a Given Schedule
For a given schedule s of program g, let A s (x) be the set of values available when value x is evaluated (i.e., the set of values generated by operations scheduled to finish before the evaluation of x starts), we define FSMD-observability as follows. Fig. 6 . Relations between observabilities. The bold arc shows the way we obtain BFSMDO, which is used as the predicate.
Definition 5.1 (FSMD-Observability). An FSMD-observability condition of g with respect to x under a given schedule s, FSMD O s (x) ≡ P A s (x) BL O(x).
Definition 5.2 (B-FSMD-Observability). A Boolean FSMD-observability condition (B-FSMD-observability) of g with respect to x under a given schedule s, BFSMD O s (x) ≡ P B FSMD O s (x).
BFSMD O s (x) is the condition we can use as the predicate of the operation that computes x. The conceptual difference between BL O, BBL O, FSMD O,
and BFSMD O lies in the set of values that are used to evaluate observability. All values in the program can be used for behavior-level analysis, while only available values are meaningful when the schedule is fixed. Theoretically, both Boolean and non-Boolean values can be used, while in practice most architectures support only a Boolean expression as the predicate of an instruction.
Using Lemma 3.1, we have the next theorem. Figure 6 , the arcs illustrate the definition of observabilities by projection, and the bold arc illustrates the way to compute BFSMD O from BBL O by projection, as stated in Theorem 5.1. Using Algorithm 1, we obtain S BL O as a BBL O, which is subsequently projected as a FSMD O for operation gating.
Theorem 5.1 uncovers relations between BL O, BBL O, FSMD O, and BFSMD O; it gives a way to compute BFSMD O under a given schedule by projecting a BBL O condition onto available values. In
Previous Work on Operation Gating
Considering the fact that only Boolean values already evaluated can be used in predicates for operation gating, the impact of scheduling on ODC-based power management is obvious. To our knowledge, the work in Monteiro et al. [1996] presents the first algorithm designed to create more opportunities for ODCbased power management. This method works as a postprocessing step on an existing schedule: it examines multiplexers (select instructions) one by one and tries to move the instruction by computing the Boolean operand earlier if possible. Authors of Monteiro et al. [1996] noticed that their results depended on the order in which multiplexers were examined, and used reverse topological order in their implementation. Chen and Sarrafzadeh [2002] propose an improved optimization technique using priority and soft dependency edges.
Both Monteiro et al. [1996] and Chen and Sarrafzadeh [2002] use a very simple method for observability analysis. They do not generalize the select operation; thus the dataflow graph contains Boolean operations such as and/or/not, which generate the control signals for select. However, those Boolean operations are essentially modeled as black boxes just like add/sub. Hence, knowledge about Boolean operations is unexploited, resulting in weaker observability conditions compared to the proposed method in Section 4. For example, in the schedule illustrated in Figure 3 , the evaluation of v3 can be avoided when v7 is false. This may look straightforward in the original code, but it is nontrivial in the if-converted form, where the first operand of the select instruction, v8, is computed later than v3. The method in Monteiro et al. [1996] and Chen and Sarrafzadeh [2002] will not find such an opportunity. Although the method can possibly be extended by viewing and/or as degenerated select, it still cannot capture the information that either operand can mask the observability of the other. Such information is essential for control-flow restructuring when the scheduler exploits different possible speculation/predication schemes under a latency constraint.
Scheduling Optimization for Operation Gating
When the schedule is optimized for operation gating, along with other objectives such as latency, different algorithm frameworks can be used. As the problem is intrinsically difficult even without the consideration of operation gating, it is often solved using heuristics like list scheduling [Landskov et al. 1980] or force-directed scheduling [Paulin and Knight 1989] . The postprocessing technique in Monteiro et al. [1996] and the approach of Chen and Sarrafzadeh [2002] can be viewed as natural adaptations of previous heuristics to the problem with consideration of operation gating. In our implementation, we extend a previous formulation of scheduling based on mathematical programming [Cong and Zhang 2006] . The formulation is still a heuristic with approximations instead of an exact method; yet it is able to optimize the schedule of all operations globally. For each operation v, an integer-valued scheduling variable s v is introduced to represent the time slot in which operation v is performed. Once the scheduling variable for every operation is decided, a FSMD model can be constructed [Cong and Zhang 2006] . The task of scheduling is thus to decide s v for every operation v.
The formulation uses two types of constraints, namely integer-difference hard constraints and integer-difference soft constraints. Both constraints have the same form, and they differ in the sense that integer-difference soft constraints are not necessarily satisfied. Integer-difference hard constraints can be used to model a wide range of traditional scheduling constraints, including dependency, latency, frequency, resource, etc., as shown in Cong and Zhang [2006] .
Integer-difference soft constraints, on the other hand, can be used to express the intention of operation gating. When it is preferred that a Boolean value c is scheduled before another value v so that v can be avoided when c takes a certain value, an integer-difference soft constraint can be added as
where d c is the number of clock cycles operation c spans, and b is the number of clock cycles needed to separate operations c and v. The value of b depends on the power management technique and the target platform: a typical value of b is 1 if clock gating or operand isolation is used; it means that the condition should be available at least 1 cycle before it can be used as a predicate for clock gating. For power gating, the number b is probably larger than 1. Then the problem of power optimization using operation gating can be described in a mathematical-programming form as follows.
Here p i , q j are constants in various constraints for dependency, frequency, latency, resource, etc. Their values are determined by various constraint generators as described in Cong and Zhang [2006] . c k are constants that serve as weights for different objectives.
To handle soft constraints in the formulation described in Eq. (8), we introduce a violation variable w j for each soft constraint j to represent the amount of violation.
A penalty term φ j (w j ) is added to the objective function and the formulation can be written in matrix form as
It is shown in Cong et al. [2009b] that the preceding formulation can be solved optimally in polynomial time with integer solutions, if each penalty function φ j (w j ) is convex. In such a case, the problem is reduced to a linear program with a totally unimodular constraint matrix [Cong et al. 2009b ]; thus it is guaranteed to have integral solutions. For nonconvex penalty functions (in this case the binary penalty, where violating a soft constraint leads to a constant cost), iterative approximation techniques are also developed that approximate a binary penalty with a sequence of linear penalty functions. More details of the solver are omitted here as they are not the focus of this article.
EXPERIMENTAL RESULTS
Experiment Setup
Techniques proposed in this work have been implemented in the scheduler of AutoPilot TM , a commercial behavioral synthesis tool from AutoESL Design Technologies, Inc. [Zhang et al. 2008] . The tool accepts C/C++/SystemC as the input language and generates RTL specifications in VHDL or Verilog. Our scheduler introduces soft constraints and formulates the problem using techniques described in Section 5. We make comparisons to three other approaches:
(1) a baseline scheduler using the SDC formulation without operation gating, (2) the iterative algorithm described in Chen and Sarrafzadeh [2002] for operation gating, (3) an Integer-Linear Programming (ILP) formulation to handle binary penalty exactly for optimal operation gating.
We will not compare our approach with the original work on operation gating in Monteiro et al. [1996] , because Chen and Sarrafzadeh [2002] is algorithmically similar to Monteiro et al. [1996] , but with an improved strategy. All these approaches are implemented in C++, and the programs run on a workstation with four 2.4GHz 64-bit CPU and 8G primary memory.
The ILP formulation (for the purpose of optimality study) is briefly described as follows. In addition to all variables and constraints in Eq. (10), a variable m j is introduced for each violation variable w j . We add constraints
where N is a large constant number so that the constraint in Eq. (11) can always be satisfied when m j = 1. Then m j is introduced to the objective with a coefficient reflecting the cost of violating soft constraint j. We also explicitly enforce the constraint that every variable is an integer. It is easy to verify that in the solution of the ILP formulation, we have
After scheduling, a binding algorithm described in is performed. The RTL code generated by the behavioral synthesis tool is fed to the Magma Talus RTL-to-GDSII toolset. Gate-level simulation under typical input vectors is performed using the Aldec Riviera simulator to obtain power dissipation. All designs are implemented using a TSMC 90nm standard cell library. In this experiment the actual operation gating is carried out by the clock gating on the output registers of the gated operations. Further power savings can potentially be achieved if we apply additional low-power techniques (e.g., operand isolation, feeding sleep vectors for leakage reduction).
Several designs in arithmetic and multimedia applications are used in our experiments. Characteristics of these designs are given in Table II . #node means the number of nodes in the CDFG; it is roughly equal to the number of operations in the program.
Results and Analysis
Results of the four approaches are reported in Table III . Here, area and power after gate-level implementation are reported for each approach. Since the Magma Talus synthesis tool meets the clock cycle time constraint for all cases, we do not report the frequency for each individual approach. We also normalize the power values to those generated by the approach with soft constraints. For some larger designs, the exact ILP formulation (solved by Clp [Forrest et al. 2004 ], a state-of-the-art open-source ILP solver) fails to find a solution within 7200 seconds. All of the three other methods finish within 60 seconds for all cases.
From the results, it is clear that operation gating is a useful technique to create opportunities for power management at the RT level without significant overhead in area. Compared to the SDC scheduling algorithm without considering operation gating, all of the three other methods that optimize for operation gating improve the power dissipation: on average, the method in Chen and Sarrafzadeh [2002] reduces power by 20.1%, the exact method given by ILP reduces power by 34.6%, and our proposed method by 33.9%. The reduction tends to be particularly significant when the design has a complex control structure, like addr. Relatively large memory blocks are present in some of the designs, including BoxMuller, MontionComp, and MotionEst. For such designs, when the access pattern to the memory is fixed, operation gating tends to be less effective, because the memory power is roughly a constant. But if we exclude the power consumed by memory (approximately 1mW in BoxMuller, more than 2mW for MotionComp, and 3mW for MotionEst) and only look at the power values consumed purely by logic (functional units, registers, interconnects), the power saving is still very significant. While power consumed in memory blocks can be a very important part of total power, it is usually not controlled by the operation scheduler when memory operations are unavoidable and the access pattern is fixed. Possible techniques that help to reduce memory power include behavioral transformations (loop transformation to enhance memory locality, to leverage burst-mode memory access, etc.), memory architecture selection, etc., but those are beyond the scope of this study. For a fair comparison, we include the memory power for every design in Table III .
Compared to Chen and Sarrafzadeh [2002] , the proposed approach further reduces total power dissipation by an average of 17.1%. This saving is because we are able to consider all opportunities for operation gating simultaneously, Cycle is in ns, area is in μm 2 , and power is in mW. The N p column is the normalized power.
and optimize globally in our approach. The approximation of binary penalty function turns out to work very well; the results generated using our approach are very close to those of the exact formulation, and the observed optimality gap in terms of power is about 1%. At the same time, our method is much more scalable than the exact formulation.
RELATED WORK ON OBSERVABILITY ANALYSIS
One might think that after partial dead code elimination in the predicated form [Bodík and Gupta 1997; Ryoo et al. 2006] , the predicate of every instruction is equal to its behavior-level observability condition because no redundant instruction will be executed along every control path. However, this is not true. Consider two Boolean values used in a Boolean and instruction, the behaviorlevel observability condition for either instruction could contain a term about the other. If behavior-level observability conditions are applied as predicates, there will be a cyclic dependency between the two instructions, and the code becomes illegal. Thus, from the perspective of behavior-level observability, one can always find instructions that are unnecessarily executed (unobservable), even after thorough compiler optimization. Since the execution of unnecessary instructions cannot be avoided completely, profile-guided optimization is needed to minimize the cost of unnecessary execution, as shown in this work. Notably, using behavior-level observability to guide scheduling gives us opportunities to unify both speculative scheduling and control-flow restructuring, as shown in Figure 3 . Previous efforts using predicates in hyperblock scheduling also allow speculative scheduling through predicate promotion [Mahlke et al. 1992] , and have the ability to simplify program decision logic using knowledge about relations between predicates [August et al. 1999 ], but a postprocessing pass like predicated partial dead code elimination is needed to strengthen some predicates after scheduling to fully realize the equivalent transformation. Behavior-level observability could provide more information to the scheduler than predicate, and can be helpful when various trade-offs are performed by the scheduler under tight constraints like latency/throughput.
The work in Wang and Roy [2003] introduced the concept of behaviorlevel observability and used behavior-level ODC to strengthen conditions for operand isolation and clock gating. However, the algorithm did not consider correlations (between Boolean values, and between data values among a network of select operation), and thus did not capture opportunities for control-flow restructuring. In addition, the work was not intended to guide architectural exploration in behavioral synthesis.
A preliminary version of this work was presented in Cong et al. [2009a] . However, the work presented there did not include a rigorous theoretical justification. The method used a similar black-box model for complex operations compared to the one in this article, but did not consider the generalized select operation. Thus, when reconvergent paths involving only select operations occur, some data correlations may be lost, leading to weaker observability don'tcare conditions. Let us, again consider the example code in Figure 3 . In practice, an experienced designer may optimize the design to one in Figure 7 , where a redundant Boolean value is introduced in the hope that it can be used in observability computation for further avoiding multiplications. The Boolean value ((a ∧ b) & 0xFFFFFFFB) == 0xA is a necessary condition for a * a + b * b == 100. A technique to add such Boolean guards has been developed in Ghodrat et al. [2007] for embedded compilers. We believe that when this technique is applied, our proposed approach on operation gating can be more effective in power reduction.
CONCLUSION
We have developed the first systematic way to analyze observability at the behavior level, and we show how behavior-level observability can guide the scheduler in a behavioral synthesis tool to maximize the chance of operation gating, enabling more RTL power management techniques. Our approach introduces the generalized select operation to capture the observability masking effect and the correlation between conditions, while modeling arithmetic operations as black boxes to enable efficient word-level analysis. Experimental results show that our approach is very effective for power reduction. The analysis using our theory also reveals possible opportunities in compiler optimization using observability don't-cares. We leave it for future work.
APPENDIX
Here we provide proofs to the theorems in Section 4. PROOF. Without loss of generality, we only consider x 1 . According to the propagation rule in a Boolean network, we have
· 4: 23
Let S f be the smoothing operator over all possible instantiations of black box f , and let S be the smoothing operator over all black boxes in the network. We have
Note that from Eq. (16) to Eq. (17), we use the fact that there exists an instantiation of f so that
We introduce the following lemma to prove Theorem 4.2.
LEMMA A.2. Let be an arbitrary binary Boolean operator. For a set of orthonormal Boolean basis s
PROOF. According to the definition of orthonormal Boolean basis, exactly one of {s 1 , . . . , s l } is true in any scenario. When s i is true, both sides of Eq. (19) equal
PROOF. According to the rule for propagation in a Boolean network without black boxes, and the definition of observability,
Here from Eq. (20) 
In the following we show that
(1 PROOF. The lemma is trivial noting that after the addition of λ, the set of functions the combination of the two black boxes can be instantiated as is the same as that of the original black box.
LEMMA A.4.
PROOF. We show the following two cases based on the value of the righthand side.
(1) If 
Theorem 4.1 is used from Eq. (39) to Eq. (40). (2) When x is used by select instructions as data input, and possibly by black boxes. The idea is to transform to the first case. For every select instruction that uses x, since the result of select is either used by black boxes, or is a primary output, so we can add a black box after the output of the select. We further move the 1-input-1-output black box to every data input of the select instruction, using the following rule. 
Then, the case is reduced to the first case.
