Abstract-This paper introduces reconditioning: a novel systematic technique for reducing unnecessary power consumption of asynchronous gate-level netlists, which involves the optimal reordering of conditional communication and logic primitives. Our technique is applicable to asynchronous circuits with handshaking protocols that encode data and control together, in particular, quasi delay insensitive and 1-of-N handshaking circuits. Both an optimal integer linear program (ILP) and a fast heuristic algorithm are presented. We show that our ILP is feasible for moderate size circuits and our heuristic algorithm scales to much larger circuits, completing in seconds on circuits with tens of thousands of gates. Our experimental results show power improvement highly depends on the structure of the circuit but can often be above 26% with typically less than 5% area overhead.
data and control signals for completion detection. The widely used 1-of-N encoding [4] and in particular quasi delay insensitive (QDI) circuits [6] are examples of such cases. In these methodologies the data rails in each channel always switch for communicating a new data token even when the new token's value is the same as the previous one. For example, in 1-of-2 four-phases handshaking in QDI circuits, each channel has three wires: d 0 , d 1 , and ack. Upon communicating each token, two of these wires perform two transitions, yielding four transitions per token. In the absence of conditional communication, this imposes significant switching activity and associated power consumption.
Despite the significance of conditional communication, computer-aided design (CAD) tool support for automatically adding/optimizing conditionality in asynchronous circuits is limited. Currently, architectural decomposition of the highlevel specification and finding the optimal location of conditional communication is mostly done manually. In some automatic high-level synthesis techniques such as [7] the location of the conditional communication with respect to logic gates is dictated by the high-level description, which may not be optimal. Similarly, operand isolation [8] has been adopted for asynchronous circuits by converting automatically inserted isolation cells into conditional communication, but this approach may also lead to nonoptimal netlists. Others translate clockgating structures derived from an register transfer language description to conditional asynchronous split-merge architectures [9] , [10] . Lastly, controller optimization in synchronous latency-insensitive designs have also been explored to save power [11] .
This paper is an extension of our previous publication [12] , which introduced reconditioning as a new power optimization technique for asynchronous circuits which involves moving the existing conditional communication primitives across logic gates to reduce power consumption. This paper first describes a theoretical framework used to prove that reconditioning preserves a notion of logical equivalence using three-valued logic [13] . Second, it shows that if the initial location of the conditional primitives with respect to the logic gates in a gate-level netlist is not optimal, one can move conditional primitives to minimize power. In particular, we formulate the reconditioning problem as an integer linear program (ILP) by extending the classic retiming problem in synchronous circuits [14] . Our experimental results show that our ILP can be solved in a reasonable time for medium size circuits of around 40 000 gates. We also develop a fast and efficient heuristic algorithm. We show that our heuristic approach can achieve close-to-optimal results with much lower computational complexity (e.g., completing in less than 12 s for a circuit with over 70 000 gates). On several pipelined arithmetic circuits, our algorithm achieved a 23% average power improvement with less than 5% area overhead. We also show a case where the conditional primitives are initially placed at optimal positions, where our algorithm declares no further improvements and an encryption core case study where our algorithms achieve up to an 8.4% reduction in power consumption. Although our objective function does not explicitly model performance, in all our examples, latency and throughput were minimally affected.
II. CONTEXT AND MOTIVATION
Synchronous circuits consist of flip-flops and/or latches for storage and logic gates. Asynchronous circuits have analogous primitives that communicate data via tokens sent on uni-directional asynchronous channels [4] . In particular, in asynchronous circuits, the analog of a flip-flop is a token buffer, which upon reset sends a token to its output and subsequently acts as a pipeline buffer. The analog of a logic gate is an asynchronous primitive which unconditionally receives tokens on all inputs, computes its logic function, and sends a token with the resulting value on its output.
The token-based nature of asynchronous circuits demands a third class of primitive, associated with implementing conditional communication. For example, the commerciallysuccessful asynchronous application specified integrated circuit flow, Proteus [15] , implements conditional communication by employing the conditional primitives called RECEIVE and SEND. These primitives are described in SystemVerilogCSP [16] as shown in Fig. 1 . They each have a data input channel L, single-bit enable input channel E, and a data output channel R. Both of these primitives unconditionally receive a 1-bit token from E. A SEND primitive unconditionally receives a token on L, but only sends a token on output R when the value of the enable token is 1. A RECEIVE primitive receives a token on L only when the value of the enable token is 1, but unconditionally sends a token on R. The value of the token on R is the same as the input token on L when the enable value is 1); otherwise it is set to 0 (a dummy value).
Explicitly surrounding a cloud of logic gates with RECEIVE cells at inputs and SEND cells at outputs is a simple and effective method to marry synthesized unconditional logic with conditional communication using commercial synchronous CAD tools [15] . However, the power consumption of this simplistic arrangement is often far from optimal. Consider the example in Fig. 2(a) , where the value of the token on the input enable channel en 1 is often 0. For every 0 token on its enable channel, the RECEIVE primitive (denoted by R) generates a dummy token to avoid deadlock in the unconditional logic gates that follow (F 1 ). These dummy tokens may pass through several unconditional primitives before being ignored (for example by a multiplexer) [15] . Suppose F 1 is such a primitive whose output is ignored by the next stage F 2 when en 1 = 0. In this case, the energy consumed in F 1 is wasted. This wasted energy is particular significant in QDI or 1-of-N implementations in which the return-to-zero nature of the handshaking causes high switching activity even upon repeatedly processing the same token value. In fact, in these implementations, setting the dummy value to be 0 or 1 typically has similar amounts of switching activity.
An improvement, illustrated in Fig. 2(b) , is to move the RECEIVE primitive to the output of F 1 . In this case, F 1 will only operate on useful data which will be consumed by the RECEIVE primitive when en 1 = 1. In particular, when en 1 = 0, the RECEIVE primitive in Fig. 2(b) will not communicate any tokens with F 1 , making it idle, and hence the wasted activity in F 1 is avoided.
Similarly, consider the logic block F 3 in Fig. 2 (a) driving a SEND primitive whose enable (en 2 ) is often 0. In this case, F 3 wastes power calculating a value that is often thrown away by the SEND. An improvement, illustrated in Fig. 2(b) , is to move the SEND primitive to the inputs of F 3 .
Moving logic gates through SEND and RECEIVE primitives is reminiscent of retiming [14] in synchronous design in which logic is passed through state-holding elements. Therefore, we use the term reconditioning to denote moving conditional gates through logic gates. In the next section, we prove that reconditioning preserves logical equivalence for a large class of asynchronous circuits called iteration-stall-free (ISF). In this class of circuits, tokens cannot stall at the input of logic gates in between system iterations and the state of the system can be captured by the last token emitted by all token buffers in the circuit, similar to its synchronous counterpart.
Notice that the number of conditional primitives before and after reconditioning can be different and thus reconditioning presents a complex optimization problem that must consider the potential decrease in switching activity and power consumption in the logic gates with the potential increase in the number of conditional primitives.
III. THEORETICAL FOUNDATIONS
The use of three-valued logic to analyze asynchronous circuits for a variety of reasons has a long history (see [17] ). In this section, we will introduce a three valued logic (3VL) model for asynchronous circuits that we will use to argue the functionality of reconditioned circuits is preserved. Due to space limitations the proofs of the give lemmas and theorems have been omitted and can be found in [18] .
A. Three Valued Logic Model
We analyze the behavior of an asynchronous system by decomposing its trace of behavior into subtraces called system iterations. In each system iteration, each channel either communicates a data value 0, or 1, or has no communication action. In other words, a channel cannot communicate more than one token per system iteration. In particular, the absence of communication on a channel is modeled by the value N. If there is no communication action during system iteration i on channel c in the asynchronous netlist, the value of variable c in clock cycle i in the three valued (3V) model would be N. This property is illustrated in Fig. 3(b) .
Values 0 and 1 are the two common Boolean logic values, whereas the value N models the condition of no communication action (also known as a spacer [19] ) on a channel of an asynchronous gate during a system iteration, as shown in Fig. 3 .
Basic 3V functions can be defined based on a set of 3V tables as shown in Fig. 4 . Primitive functions in three-value logic. (a) A ∧ B (AND).
The ∧ and ∨ operators represent logical functions similar to Boolean logical AND and OR functions as long as none of the inputs is N, otherwise the output of these functions is N. The inverting operator ¬ is the same as the Boolean inverter when the input is 0 or 1, but its output is N, when its input is N.
The RECEIVE operator, denoted , behaves like a buffer when enable input E is 1. Whereas when E is 0, the output R is 0 irrespective of the value of L. The SEND operator, denoted , also behaves like a buffer when E is 1. Whereas when E is 0 or N, its output is N.
We use the term unconditional function to distinguish between logic functions and SEND/RECEIVE functions.
Definition 1 (Unconditional Function):
A function f is called unconditional if there exists a representation of f which is obtained only by composition of ∧, ∨, and ¬ functions.
For example,
An unconditional function f has an important property: the output of f is N if and only if at least one of its input variables is N. This can be directly concluded from the tables of Fig. 4 (a) and (b). It is important to observe that RECEIVE and SEND do not have the above property.
Formally, we state the following lemma.
Definition 2 (3V Literals): A 3V literal x C is a 2V function of the form
where C ⊆ T .
Below are some example literals for a 3V variable x
Obviously, for a 3V variable x : x T = x {0,1,N} = 1. A 3V function f can be written in the form of a sum-ofcubes for each of its values, where each value f {γ i } : γ i ∈ T , is represented as a Boolean function. Examples of such representations are as follows.
Example 1: For the 3V function RECEIVE in the form r(l, en) = l en, using the table in Fig. 4(d) we can write
Example 2: A 3V multiplexer with data inputs i 0 and i 1 , select input s, and output o can be described as the following unconditional function:
Notice that for o {0} and o {1} , it is required that all input variables have a 0 or 1 value. Otherwise, based on Lemma 1, the value of o would be N.
B. 3V Networks
A 3V network is a directed acyclic graph G(V, E), where V is the set of nodes and E is the set of edges. Each node represents a 3V function with a single 3V output and several 3V inputs. As such, a 3V variable is associated with each node. A 3V network has a set of primary input and a set of primary output nodes. A 3V variable is also associated with each primary input node.
In order to distinguish the type of nodes in the circuit, we define P = {I, R, S, U, O} as a partition of V consisting of I (primary input nodes), R (RECEIVE nodes), S (SEND nodes), U (unconditional nodes), and O (primary output nodes).
There is a directed edge (u, v) ∈ E from the node u to the node v if a function represented by the node v explicitly depends on the output variable at the node u. The node v is called the fanout or the successor of the node u and the node u is called the fanin or the predecessor of the node u.
The output variable y of a node v ∈ V with input variables
Alternatively, the output of a node can be written as a function of primary inputs.
A path p in a 3V network is a sequence of nodes and edges. We assume that all paths are simple paths, i.e., they contain no nodes more than once. If a path starts at a node u and ends at a node v, we use the notation p uv or u v. The transitive fanin (TFI) of a node v is the set of nodes u from which there exists a path p uv . The transitive fanout (TFO) of a node v is the set of nodes u to which there exists a path p vu . The 3V behavior of an n-input, m-output 3V network is a 3V function F : T n −→ T m . Two networks are equivalent if there is a one-to-one correspondence between their respective primary inputs and primary outputs, and if their corresponding 3V behaviors are equivalent.
A node representing an unconditional function is called an unconditional node, otherwise it is called a conditional node.
Based on Lemma 1, an unconditional node performs useful calculation only when all of its inputs are simultaneously non-N. We shall call variables whose values are simultaneously N or simultaneously non-N synchronized variables.
Definition 3 (Synchronized Variables):
In a 3V network, two variables x 1 and x 2 are synchronized if the following expression evaluates to 1:
(
We denote synchronized variables by writing:
Notice that our definition implies that the synchronization relationship is an equivalence relation.
Lemma 2:
is an unconditional function whose inputs are synchronized:
In this paper, we focus on the power optimization of acyclic asynchronous netlists consisting of only combinational gates. We model these circuits with 3V networks.
C. 3VL Model Limitations and Iteration Stall-Freedom
The 3VL model does not capture all behaviors of an asynchronous circuit. First, it does not model the flow-control nature of asynchronous gates and processes; therefore, it cannot capture behaviors in which tokens are stalled [4] , [18] at the inputs of asynchronous gates across system iterations.
Consider for example the simple asynchronous netlist of Fig. 5 which communicates with environment modules S and R. Fig. 5(a) shows the first iteration of the S module, in which it sends a token to the BUF gate. The BUF gate receives a token from its input and sends it to its output to be received by the AND gate. But the AND gate needs to stall and delay the completion of the receive of the first token from the BUF gate until the S module starts its second iteration in which it sends another token on channel C 2 , as shown in Fig. 5(b) . Once the AND receives both tokens, it can send a token with value 0 on channel C 4 . It is important to notice that in order for the AND gate to complete one iteration, S should complete two iterations.
The corresponding 3VL model, however, does not reflect this behavior. Fig. 6 (a) and (b) shows the first and the second clock cycles in the 3VL models, respectively. In both clock cycles, the value of C 4 is N, rather than 0. The reason for this difference is that in an asynchronous netlist, asynchronous gates can create stalls during which tokens can stay on channels. In our example, the AND gate can delay the Receive on C 3 (and hence stall the BUF gate) until the second token arrives on C 2 , which does not take place only until the second iteration of the S module. During this stall, the BUF keeps its output token on C 3 .
In the 3VL model, however, unconditional nodes cannot stall other nodes, nor can they keep a value on their outputs if the value of their inputs change. Therefore, the value 1 on C 3 in Fig. 6 in the first clock cycle will not be stored and cannot be reused in the second clock cycle.
The 3VL model is thus a simplified model, and we shall use it only for modeling a subclass of asynchronous systems that have a special property called iteration-stall-freedom.
Intuitively, in an iteration-stall-free asynchronous system, if a module P completes one iteration, any asynchronous gate A in the system either does not start its iteration at all, or if it starts its iteration, it can complete the iteration without being stalled until P's next iteration. In other words, A can complete the receive of all of its input tokens within one iteration of P. Therefore, no token needs to be kept on a channel from the first iteration of P to the second iteration of P. That is to say, the completion of one iteration of a module does not require the completion of more than one iteration of other modules, and hence we can divide the system's behavior into iterations in which every module either does not start its iteration or it can start and complete exactly one iteration.
In this paper, we assume that the environment is constrained, such that the asynchronous system defined by the asynchronous netlist and the environment is iteration-stall-free. In our example, if S was constrained such that it would send a token on both channels in each iteration, BUF and AND would not have been required to stall until S's second iteration. In this way, iteration-stall-freedom is similar to the burst-mode circuit requirement that inputs arrive in bursts [20] . In practice, a large class of asynchronous circuits and their environments are iteration-stall-free. For example, it is common to constrain the environment to provide all of the inputs in only one iteration, and also receive all of the outputs in only one iteration.
It is worth mentioning that under such a constraint, the communication actions on C 1 and C 2 are called synchronized communication actions [21] , and variables C 1 and C 2 in the 3VL model are synchronized variables (Definition 3). Therefore, in an iteration-stall-free 3VL model, certain variables are constrained to be synchronized. In the following, we formally define an iteration-stall-free 3V network. An asynchronous netlist is iteration-stall-free if its corresponding 3VL model is iteration-stall-free. Using the 3VL model, we can now define a notion of equivalence for two asynchronous netlists.
Definition 5 (Logical Equivalence): Two acyclic iterationstall-free asynchronous netlists with combinational circuits are logically equivalent iff their corresponding 3V networks are equivalent.
In the next sections, we show that our proposed poweroptimization modification methods, namely conditioning and reconditioning, preserve logical equivalence and iteration-stallfreedom.
D. SEND Reconditioning
An important property of the operator is its right distributivity on unconditional functions. If we start with the network on the left-hand side of Fig. 7 and move a SEND node from the output of a node to its inputs [ Fig. 7(right) ], when en {0} , we get y
, and since N represents no communication, depending on how often en {0} , this may lead to less switching activity in f in the corresponding asynchronous netlist. On the other hand, if we start from the network on the right-hand side and en is often 1, it makes sense to move multiple SEND nodes to a single SEND at the output of f and save area and switching activity in SEND nodes.
Notice that for this transformation to be correct, not only should node f have a SEND on all of its inputs in the righthand side network, but also the enable variables of all SEND nodes should be equal. Also, in order to keep the network acyclic, we should be careful that such a transformation does not create a cycle. Therefore, we require the node with output en not to be in the TFO of f .
Lemma 3 (SEND Reconditioning): Let node u represent an unconditional function y = f (x) = f (x 1 , x 2 , . . . , x n ) and let en be an output variable of a node v. Then
The result of Lemma 3 is that if a node u of 3VL implements the function y 1 = f (x) en, replacing u with the node u c implementing the function y 2 = f (x 1 en, x 2 en, . . . , x n en) preserves equivalence. On the other hand, the reverse transformation states that if a node u c of 3VL implements the function y 2 = f (x 1 en, x 2 en, . . . , x n en), replacing u c with the node u implementing the function y 1 = f (x) en preserves equivalence. The proof that SEND reconditioning preserves ISF is given in [18] .
E. RECEIVE Reconditioning
In general, is not right-distributive on unconditional functions. For example, consider f (x 1 , x 2 ) = ¬x 1 ∧ ¬x 2 , as shown in Fig. 8 . It is easy to verify that when en {0}
The difference is because in the table of Fig. 4(d) , we defined ∀x : x 0 = 0. We define a new function RECEIVE 1 denoted by 1 , such that ∀x : x 1 0 = 1 as shown in Fig. 9 . Now, we can write
Equation (2) states that if we start from the left network in Fig. 8 , in order to get an equivalent network, we need to replace the RECEIVE node by RECEIVE 1 node as it is being moved from the output of f to its inputs. Similarly, (3) states that if we start from the right network in Fig. 8 , in order to get to an equivalent network, we need to replace the RECEIVE nodes by RECEIVE 1 nodes as it is being moved from the inputs of f to its output.
This problem is analogous to the register initialization problem after reverse retiming in synchronous circuits [22] . In general, we can state the following.
Lemma 4 (RECEIVE Reconditioning): Let node u represent an unconditional function y = f (x) = f (x 1 , x 2 , . . . , x n ) and let en be an output variable of a node v such that v ∈ TFO(u). Then The result of Lemma 4 is that if a node u in 3V network implements the function y 1 = f (x) * l en, replacing u with the node u c implementing the function
. . , x n * n en) and vice versa preserves equivalence (assuming we independently replace each * i with or 1 such that
The application of Lemma 4 is shown in Fig. 10 . Notice that for the right-to-left transformation, one only needs to find the value of f (0, . . . , 0). If this value is 0, a RECEIVE should be placed at the output of f, otherwise, a RECEIVE 1 . For the leftto-right transformation, however, to preserve equivalence one needs to find the values for x 1 , x 2 , . . . , x n , such that f (x) = 0, which may not have a unique answer. However, the generalized application of this style of left-to-right transformation, in which multiple RECEIVEs are pushed backward through a block of logic, may lead to a case where there is no feasible assignment of inputs to or 1 's that preserves logical equivalence to the original network when en = 0. 1 Fortunately, in many applications the output value of the initial RECEIVEs when en = 0 is a don't care to the downstream logic [15] and thus maintaining this local logical consistency for such applications is not required to preserve logical equivalence of the larger circuit. The proof that RECEIVE reconditioning preserves ISF is given in [18] .
IV. PROBLEM FORMULATION

A. Problem Definition
In [12] , we formulated the recondition problem to minimize switching activity. In this extended version, we minimize power consumption, considering both dynamic and static sources. Let S v be the static power of a gate/primitive and A v be the switching power of a gate/primitive v when the gate is active. For logic gates, this number represents the amount of switching power necessary for communicating on all input channels, calculating the output value, and communicating on the output channel. For conditional primitives, this number represents the amount of switching power necessary for communicating on enable and completing one iteration of the forever loop in Fig. 1 .
B. Operational Factors
In order to capture how often an unconditional gate u is active, we assign a value called the operational factor 0 ≤ O u ≤ 1 to u, which is the fraction of the time that u is operating, i.e., it receives a value on all inputs and sends a value to its output.
We consider three cases for the enables of a conditional primitive c: receiving a token with value 0, receiving a token with value 1, and receiving no token. We denote p {1} en c to be the probability of receiving enable with value 1, p {0,1} en c to be the probability of receiving enable with value 0 or 1, and 1 − p {0,1} en c to be the probability of receiving no value on enable. These probabilities can be obtained through simulation. A conditional primitive c is active each time it receives a value on its unconditional (i.e., enable) input. Therefore the operational factor of a conditional primitive is defined as O c = p
C. Objective Function
Once the operational factors are calculated, we minimize the following:
where U is the set of unconditional gates and R and S are the set of RECEIVE and SEND primitives, respectively. The reconditioning problem is rearranging RECEIVE and SEND primitives such that P Total as defined in (4) is minimized while equivalence and iteration-stall-freedom are preserved as proven in Section III.
Note that reconditioning does not change the operational factor O c of a conditional primitive c, since c will not be moved past other conditional primitives. The model excludes the static power of unconditional gates because the number and type of unconditional gates does not change during reconditioning and thus their static power consumption should remain roughly constant. Also note that this model does not capture the change in dynamic power consumption of combinational gate due to reconditioning. 2 
D. Model
We limit the scope of this paper by only allowing conditional primitives to move across logic, but not state-holding elements. This way we can avoid modifying the initial conditions for flip-flops, which simplifies the use of commercial formal equivalence checking tools that can prove the pre 2 Indeed, the probability of different input combinations to combinational gates may change due to reconditioning. However, transistor-level SPICE analysis of several typical QDI logic gates implemented in the precharged half buffer template in a 65nm technology confirmed that the difference in power consumption across different input combinations is less than 10.8%. and post reconditioning circuits are equivalent (see [23] ), as described in Section III. This also allows us to model the netlist using an acyclic reconditioning network G = (U, E) created by cutting the original circuit at flip-flops, treating the input of each flop as a primary output and the output of each flop as a primary input. The set U consists of all the unconditional logic gates, to which we add primary inputs (I), and primary outputs (O) for mathematical convenience.
1) Each edge e = (u, v) ∈ E is weighted with a conditional primitive counter W(e) ∈ Z ≥0 , which specifies how many conditional primitives originally existed between gates u and v.
2) For each path
−→ v between gates u and v in the reconditioning network, where j denotes the jth path, we define the path weight to be the number of conditional primitives that originally existed along that path
3) Each unconditional gate u ∈ U is weighted with its distance: D u ∈ Z ≥0 , specifying the maximum path weight among all the longest paths starting from a primary input
The distance of primary inputs is set to zero. 4) To each edge e = (u, v) ∈ E, we assign an initial possible conditional vector PCV(e), where each item in the PCV(e) is a three-tuple: a) the type (RECEIVE or SEND); b) the enable source; and c) the estimated probability of the enable being a logic 1. This initial PCV represent the path consisting of conditional primitives that initially existed between gates u and v. Fig. 11(a) shows an example asynchronous netlist (the sources of enables are not shown). Fig. 11(b) shows the corresponding reconditioning network, where the conditional primitives are removed, the initial distances of gates are annotated by D, and the initial weights of edges are annotated by W. Notice that by removing conditional primitives paths such as u 3 → R 3 → S 1 → u 4 in the original circuit are contracted to an edge (u 3 , u 4 ) in the reconditioning network. These new edges are called purely conditional edges.
V. POSSIBLE CONDITIONAL VECTORS
The valid solution space of reconditioning can be characterized using possible conditional vectors (PCVs) for edges of a reconditioning network that effectively identify how far each RECEIVE and SEND primitive can be moved while preserving equivalence and iteration-stall-freedom. To compute the PCVs systematically, we define the following vector operations. to K n end with. 3
The vector that has all members of K i in sequence, starting with i = 1 and ending with i = n. Example 3:
a, c, d).
We define the following operations to grow initial PCVs of outgoing (incoming) and incoming (outgoing) edges of gates. 
. , O n ).
We define PUSH(u) to create the following new PCVs on the m outgoing edges:
. . , CAT(T, O m ).
Similarly PULL(u) is defined to create the following new PCVs on the n incoming edges: Fig. 12 shows an example of PUSH and PULL operations on a gate u 3 . Given the PCVs of incoming edges of an unconditional gate u, the PUSH operation creates a longer PCV for each outgoing edge. Similarly, given the PCV of outgoing edges of an unconditional gate u, the PULL operation creates a longer PCV for each incoming edge. The formal proof that a single PUSH/PULL operation preserve logical equivalence is in Section III. Notice that each PUSH/PULL operation on u changes its distance by exactly 1. The operational factor of u at this new distance is calculated by the method described in Section IV-B.
. . , CAT(I n , H).
A. Finding the Longest PCV
We use successive PUSH/PULL operations to grow the PCV e vectors for all the edges e in the reconditioning network, as described in Algorithm 1. First, we assign an initial vector PCV e to each edge e ∈ E, as described in Section IV-D. We then sort the gates in ascending topological order, since we would like to perform PUSH operation on a gate only F ← list of gates u ∈ U in ascending topological order 3:
while F is not empty do 5: Remove node u from F's front and add it to S 6: Update PCV e of u's outgoing edges e using PUSH(u) 7: end while 8: end procedure 9: 10: procedure SUCCESSIVEPULL
11:
B ← list of nodes u ∈ U in descending topological order 12: S ← {} 13: while B is not empty do 14: Remove a node u from B's front and add it to S 15: Update PCV e of u's incoming edges e using PULL(u) 16: end while 17: end procedure 18: 19: procedure FINDLONGESTPCV 20: for all e ∈ E do 21: if e is a purely conditional edge then 22: PCV e ← the PCV representing the contracted path 23 after PUSH is performed on all of its fanins. Subsequently, we perform a PULL operation on the incoming edges of a gate in descending topological order, such that a PULL operation on a gate is performed only after a PULL is performed on all its fanouts. The complexity of PUSH/PULL algorithm is O(m + n). For each unconditional gate u, the PUSH/PULL algorithm finds two constants: L u and U u which are the lower and upper bounds on the distance of the unconditional gate
VI. ILP FOR RECONDITIONING
Inspired by the linear programming solution for the retiming problem [14] , we formulate the reconditioning problem as an ILP. First, we define the following parameters and variables.
Known constant values are denoted by capital case letters, whereas unknown variables are denoted by lower case letters.
Parameters: The parameters are as follows.
The initial distance of each unconditional gate. This value is calculated as explained in Section IV-D. 2) L = max u∈U (D u 
The lower and upper bound on the distance of unconditional gate u. Since we do not perform reconditioning through primary inputs and primary outputs, we set these bounds for them to be equal to their initial distances
The operational factor of an unconditional gate u if its distance is l. 5) W e , e ∈ E: The initial number of conditional primitives on edge e.
The switching power of n when active, where n is either an unconditional gate or a conditional primitive.
The averaged dynamic power of an unconditional gate u if its distance is l. 2) r u ∈ Z, u ∈ U: The difference of the final distance and the initial distance of an unconditional gate u (displacement). Based on operational factors O ul and distance l ∈ {0, 1, . . . , L} of each unconditional gate u ∈ U, we can calculate its averaged dynamic power as
Note that the number of conditional primitives on an edge e = (u, v) after reconditioning can be calculated as: w e = r v − r u + W e . The total number of conditional primitives after reconditioning is thus given by
where FI is the number of incoming edges of unconditional gate u, and FO is the number of outgoing edges of u [14] . Let C u = |FI(u)| − |FO(u)|. We then get
Because the first term in (6) is a constant, we include only the second term in the ILP objective function. FIG. 11(a)   Fig. 13 . Multiple reconditioning results of the sample circuit in Fig. 11(a) . Moreover, using the distance and C u of each unconditional cell u ∈ U, the probability of its enable being active p {0,1} e , and the switching active power A c of each conditional primitive c ∈ R ∪ S being moved, we can calculate the averaged difference in dynamic power due to moving conditional primitives across unconditional an gate u ∈ U. Table I shows P ul of the sample circuit illustrated in Fig. 11(a) . P ul of each unconditional gate with distance l = D u is 0 since none of conditional primitives are moved. If number of incoming edges of an unconditional gate u ∈ U is the same as number of outgoing edges, P ul = 0, l ∈ {0, 1, . . . , L} since number of conditional primitives does not change. Fig. 13 illustrates some possible reconditioning results of the sample circuit in Fig. 11(a) and Table II shows the dynamic power of the conditional primitives and how it relates to P ul . IN FIG. 13 Using the above variables and parameters as well as (4), we formulate the ILP problem as follows:
such that:
Note that third part of (4) is the static power of unconditional gates which is a constant; hence, we exclude it from the objective function. Constraint (8) forces the distance of each gate to be a number between 0 and L. Constraint (9) ensures that the number of conditioning primitives after reconditioning on each edge in the reconditioning network is non-negative. Constraint (10) states the relationship between the initial distance of each gate D u , its displacement r u , and its final distance d u . The lower and upper bound constraints on distances are enforced by constraint (11) .
Sharing of registers on fanout edges of logic gates has been addressed in retiming [14] , however, formulating sharing of conditional cells in the reconditioning ILP is not as straightforward since the type of conditional cells (i.e., SEND/RECEIVE) at the fanout of a logic gate and their enable values may be different. Therefore, rather than formulating sharing in the ILP, we perform a post processing step: after solving the ILP, for each unconditional gate u, if there are conditional primitives of the same type with the same enable value on a subset of fanout edges of u, we will combine the conditional primitives of those edges, reducing the number of SEND and RECEIVEs.
In addition, note that the objective function does not model the impact of reconditioning on the performance of the circuit. In fact, reconditioning will not change the number of conditional cells along a path and paths that go through only input and output of conditional cells will not change latency. However, paths through the enable input of conditional cells may get shorter or longer and thus cycles that involve such paths may become globally critical, slowing down the entire circuit. Performance-aware reconditioning is left as future work.
A. Limitations of the ILP Formulation
Compared to the retiming formulation in [14] , our objective function in (7) has two extra terms (the first two terms) which prevents us from mapping it to the minimum cost circulation problem with a polynomial runtime solution [14] . Although ILPs are known to be intractable for large circuits, since we do not move RECEIVE/SEND primitives through state-holding elements as described in Section IV-D, the circuit could be partitioned into pipeline stages that are each modeled with an acyclic graph (DAG), and the reconditioning problem can be solved for each partition independently and concurrently. This would reduce the size of the ILP significantly, without losing optimality. Although some researches have shown some forms of retiming on DAGs can be optimally computed in polynomial time (e.g., the clock minimization problem), it is still unclear if this can be done with the more general form of retiming that we are exploring (based on the register minimization problem) [24] .
Our experimental results show that the ILP problem can be solved in reasonable time for medium size networks. As the size of the network grows higher than about 40 000 gates, the ILP becomes intractable.
VII. GREEDY APPROACH FOR RECONDITIONING
As an alternative to the ILP, we present a fast and efficient heuristic algorithm that uses the same constraints that the ILP problem adheres to, but moves one (or a few) conditional primitives at a time. In particular, for an unconditional gate u ∈ U with current distance d u , let w i1 , w i2 , . . . , w in be the current weights of its incoming edges, and w o1 , w o2 , . . . , w om be the current weights of its outgoing edges. We represent a move m by a pair m = (u, r), where r ∈ Z. Such a move m = (u, r) is legal if
Fig. 14 shows the result of committing a move. Notice that r can be either positive or negative.
The power improvement of a legal move m = (u, r) can be calculated as It is important to observe that committing a move m = (u, r) only affects the moves associated with u and its neighboring gates, i.e., gates in N u = {v ∈ U|(u, v) ∈ E ∨ (v, u) ∈ E}. Any other move stays valid and the power improvement associated with it does not change. Consider the example in Fig. 15 , in which we commit the move (u 3 , −1). As a result, the moves
, and (u 3 , −3) become invalid and the new moves (u 3 , −1), (u 3 , −2), and (u 4 , +1) should now be considered as candidate moves, as shown in Fig. 15(b) . Notice that the power improvements of the moves (u 3 , −1), (u 3 , −2) before and after committing the move (u 3 , −1) may be different, and hence we need to reevaluate them.
The greedy algorithm is presented in Algorithm 2. We first initialize moves, maintaining a list of all possible legal moves sorted in nonincreasing order with respect to p m . By limiting the maximum displacement of moves in Disp set, we can control the size of moveList. We greedily and repeatedly pick the move m = (u, r) with the highest p m and commit the move by updating the distance of u and the weight of u's incoming and outgoing edge values using (14)- (16) . After committing a move, the moves for the gates v ∈ {u} ∪ N u that become invalid, i.e., violate constraints (14)- (16), are removed from the move list. Also note that, as a result of committing a move, we add in moves in neighboring gates of u which become legal.
Let n = |U|, l be the number of possible displacement values, D max = max u∈U (Degree(u)), and R max = max u∈U (U u − L u ), where Degree(u) = |N u | is the number of edges at u. The maximum size of the move list M is then nl. The complexity of the greedy algorithm is O(D max R max nl log(nl)).
Finally, as in the ILP approach, once the greedy algorithm terminates, we traverse the netlist to share SEND/RECEIVES when possible.
VIII. EXPERIMENTAL RESULTS
Our experiments use a TSMC 65 nm QDI library described in [15] whose front-end views were provided for for all u ∈ U, d ∈ Disp do 3: if move m = (u, d) is legal then 4: Let p m ← the change in power consumption 5: if p m < 0 AND (u, r) is not in moveList then 6: Insert (u, r) into moveList 7: Keep moveList in non-increasing order of p m 8: end if 9: end if 10: end for 11: return moveList 12: end procedure 13: 14: procedure GREEDYRECON M ← empty 17: ADDMOVES(M, U) 18: while M is not empty do 19: Remove & commit move m = (u, r u ) from M's head 20 :
Remove invalid moves involving u ∈ N u from M 22: ADDMOVES(M, N v ) 23: end while 24: end procedure academic research. Unfortunately, this library does not have power characterized which limits ability to use realistic power characterizations. Thus, even though the theory supports a more realistic model of power, we adopted a simpler unit dynamic power model for our experiments in which we assume that for every conditional/unconditional gate v, A v = 1. Moreover, for static power of the conditional primitives S c , we assume SEND and RECEIVE consume 0.5 units of power. This places a significant cost to any reconditioning that increases the number of conditional cells. Finally, we estimate the static power of every unconditional gate to be the static power of a RECEIVE multiplied by the ratio of their areas.
For the greedy algorithm, we tested a two-step approach in which we initially run the algorithm with S c = 0 and A c = 0 to move the conditional primitives freely without falling into local minimum, and then we reran the algorithm with S c = 1 and A c = 1 to optimize the placement of conditional primitives. We also explored running a greedy peep-hole optimization [15] in between the two passes to remove redundant back-to-back conditional primitives. We performed a complete case study of both algorithms on the following pipelined arithmetic circuits. 1) ADDMULT: An adder/multiplier dual-mode arithmetic unit. Using conditional communication, in each mode data is only sent to the subunit that is performing useful calculation. 2) CONDMULT: A multiplier with enable. This represents circuits that conditionally get activated, receive data, compute, and send the result to the output. 3) LCM: A block that calculates least common multiple of two inputs using Euclid's iterative algorithm. This block has cycles and state-holding elements. 4) ALU: A four-mode ALU with ADD, SUB, MULT, and AND submodules, in which conditional communication was carefully added such that data is only sent to the submodule performing useful operation. To keep our analysis simple, we chose examples in which the conditionality yields only two distinct operational factors: 1 and α, where 0 < α < 1, as shown in Table III . For each of these examples, we sweep the value of α within the following values 0.25, 0.50, and 0.75. In addition, we sweep the input/output bit-widths of the circuit to evaluate the performance and efficiency of our algorithms and comparing the ILP versus the heuristic approach as the size of the circuit changes. We have performed our tests on a 64-bit Linux server with eight Intel Xeon CPU cores. Our algorithms were implemented in C++ using standard template library using the GNU GLPK package [25] 3 The average power improvement across examples shown in these figures is 26%, suggesting the amount of power savings can be quite substantial. In fact, for the CONDMULT, and LCM circuits with large bit-widths and low activity factors we save up to 45% of the power. The amount of savings tends to be correlated with bit-widths. We get significant but slightly smaller power savings with lower bit-widths. This is likely because with smaller bit-widths the conditional primitives represent a larger fraction of the logic.
For the CONDMULT example, after reconditioning, the number of conditional primitives is the same as before reconditioning; hence, the area overhead is 0% in all bit-width and α. For the ADDMULT example, the ILP approach reduces number of conditional primitives so area overhead is 3 Note that for clarify of the figures, the bit-width axis in power and area plots are reversed. negative whereas the greedy approach increases the number of conditional primitives causing an area overhead of 8%. For the LCM, both reconditioning algorithms slightly increase the number of conditional primitives causing an area overhead of less than 2%. To appreciate the significance of this, recall that reconditioning only changes the number of conditional cells and not the number of logic gates. In fact, the number of conditional cells will increase only if the area overhead is justified by an overall reduction in power. Thus our results suggests that most wasted activity can be recovered with few additional conditional cells.
As expected, in most cases the ILP gets better results than the greedy algorithm. On average, the greedy approach yields 80% of the power saving of the optimal solution. For the LCM example, we show two power saving plots in Fig. 18 : Fig. 18(a) before the post-processing sharing of primitives described in Section VI and Fig. 18(b) after the sharing of conditional primitives. When comparing the results before sharing, the optimal ILP approach achieves better results than the greedy approach, as expected. After sharing conditional primitives, however, the greedy approach achieves larger savings. This somewhat unexpected results makes sense because our ILP approach does not consider sharing of conditional primitives and thus cannot guarantee optimal results after the post-processing sharing algorithm is applied.
We do not provide any graphs for the ALU example since the placement of the conditional primitives in the ALU were already near optimal before reconditioning. Thus, reconditioning does not move any of the conditional primitives and does not reduce power at all. This example emphasizes the fact that reconditioning is not guaranteed to always save power, and the amount of savings is a function of the structure of the circuit and in particular the optimality of the original placement of conditional cells. Fig. 19 shows the run-time comparison of the ILP and greedy algorithms on our sample circuits. These charts show that the run-time of the greedy approach is often over an order of magnitude faster than the ILP approach and grows much slower than the ILP as a function of gate count. Our results show that the heuristic can finish in less than 10 s for circuits with tens of thousands of gates. Note that the ILP algorithm times out after 24 h for the 16-and 32-bit LCM examples, whereas the greedy algorithm still completes in seconds and still significantly improves the power.
After reconditioning the image netlists were automatically clustered into pipeline stages, slack-matched, and translated to a QDI gate-level implementation using the Proteus flow [15] . We examined the latency and performance of the resulting circuits. Not surprisingly for the circuits with no cycles, including the ADDMULT and CONDMULT the throughput of the resulting circuits was not reduced by reconditioning. This makes sense because the clustering and slack-matching algorithms can easily compensate for the altered location of the conditional primitives. The impact on latency was small and a bit random. For example, the latency of the ADDMULT example increases by 2%, but only for the greedy algorithm. The latency of some bit-widths of CONDMULT did not change, while others reduced by 6%. The LCM example is cyclic by nature and thus can experience changes in both throughput and latency (as measured by the longest path from primary inputs to primary outputs that do not pass through token buffers). In a couple of bit-widths, we saw an increase in throughput by as much as 10% suggesting the movement of conditional primitives reduced the length of the critical cycles. The latency for this example increased by about 10%.
As a final case study, we implemented a QDI version of DES-X [26] using Proteus [15] . The DES-X algorithm is a variant of the data encryption standard (DES) in which a technique called key whitening is used to increase the complexity of a brute force attack. In particular, it takes 16 iterations to compute the encrypted data and adds key-whitening on first and last iteration, as suggested by the following equation:
The architecture of the implementation highlighting the original location of the conditional communication primitives is shown in Fig. 20 . Because α = p 1 en 1 = p 1 en 2 = 0.0625, both the greedy and ILP algorithms moves the left RECEIVEs of k1 and M through the XOR gates, not only saving dynamic power of XOR but also reducing number of RECEIVEs. The first pass of the greedy algorithm also moves the SENDs before the XORs since the cost of number of conditional primitives is 0. The subsequent peephole optimization removes the now back-to-back RECEIVEs and SENDs with en 2 , saving more area and power. The second pass of the greedy algorithm is then called but it makes no further changes. Notice that since the ILP does not model the potential benefits of the peephole optimization it does not move the SENDs and thus saves less power. In particular, the two-step greedy approach with peep-hole optimizations yields 8.4% savings in total power and 5.4% savings in area whereas the single-pass ILP approach achieves only half this savings. It may be interesting to note, however, that even after these optimizations, over 90% of the channels in this design are active every cycle. 4 IX. CONCLUSION Achieving low power in asynchronous circuits does not come for free. It often requires careful designer's attention and/or manual intervention of the conditional communication built into the circuit. This paper introduced a formal framework to automate the reconditioning problem, i.e., finding the optimal placement of conditional communication primitives to minimize power. In particular, we introduced a mathematical framework based on three-valued logic to formally define the reconditioning problem.
Moreover, we presented an ILP formulation and a simple but effective fast greedy algorithm. Furthermore, we proved that our solutions preserve equivalence and iteration-stall-freedom.
Our experimental results show significant power reduction can be achieved with low area overheads and reasonable run times.
There are several interesting areas of future work. First, extending the ILP formulation to consider the optimal sharing of conditional primitives as well as peep-hole optimizations would significantly improve its results. Second, a dedicated ILP solvers may be able to take advantage of the problem's structure and support larger circuits than the generic solver currently used. Moreover, extending both approaches to consider performance and implement performance-constrained reconditioning is also interesting. In particular, since reconditioning changes the structure of the circuit, it can change the critical cycles that dictate performance. Because modeling these structural changes within the ILP is likely challenging, we suspect it is easier to integrate performance constraints into the greedy algorithm. Lastly, enabling a larger set of reconditioning moves, including reordering conditional cells and allowing conditional cells to pass through flip-flops, may also be beneficial, but this will add another level of verification to ensure these moves preserve equivalence.
