We describe a procedure ( 
Introduction
In [4] , a new method of logic synthesis was introduced. We refer to this method as LS_TE, which stands for Logic Synthesis preserving Toggle Equivalence. As shown in Figure 1 , assume that a partitioning of N into subcircuits N i , i=1, 2,…, k is specified. The main idea of LS_TE is to optimize N by replacing each subcircuit N i with a toggle equivalent counterpart N Unfortunately, [4] did not provide a specific procedure that, given a subcircuit N i , would build a toggle equivalent subcircuit N * i .
The main contribution of this paper is the introduction of such a procedure which we refer to as the Toggle Equivalence Preserving (TEP) logic optimization procedure. In the TEP procedure, we use a non-trivial convergence scheme that makes this procedure structure-agnostic. That is if a circuit M¢ toggle equivalent to an original circuit M is built by the TEP procedure, the topology of M¢ is not limited to that of M. This important feature of TEP is discussed in Section 3.
In the formulation of LS_TE given in [4] , circuit N to be optimized is partitioned into subcircuits N i , i=1,..,k However, LS_TE can be also applied if, for example, subcircuits N i share internal gates. Suppose, for instance, that subcircuits N 3 . Such logic sharing can be done by the TEP procedure (slightly modified). However, a discussion of this topic is beyond the scope of this paper.
As we mentioned above, for single-output Boolean functions, toggle equivalence is the same as functional equivalence modulo negation. So, besides enabling LS_TE, the TEP procedure can be used in traditional logic synthesis. In order to compare the TEP procedure with traditional logic synthesis, the focus in this paper is on optimization of single-output functions. Even though the vast optimization flexibility of the TEP procedure can not be invoked for single-output functions, it still has the advantage of being structure-agnostic. As a consequence, for many single-output circuits the TEP procedure found better solutions than SIS [9] . Our initial results also show that for multiple output circuits (where the vast optimization flexibility can be exploited), the LS_TE procedure gave significant improvements over SIS The rest of this paper is organized as follows. Section 2 discusses related previous work, including a comparison and contrasting of LS_TE and TEP with SPFDs [1] [10] [12] . In Section 3, we emphasize some important features of LS_TE and TEP procedure. Section 4 provides definitions. Section 5 details our TEP procedure. In Section 6, we report results of our experiments. Section Error! Reference source not found. concludes the paper, with some directions for future work in this topic.
Previous work
Multi-level logic synthesis can be performed using algebraic means such as factorization [2] , kernelling [2] [11] etc. Although these techniques are fast, being algebraic, they explore only a limited portion of the optimization space. Other techniques like ODC [6] [7] and CODC [8] perform don't care based optimization, but they do not modify the structure of the circuit. (Sometimes a node gets removed as a result of don't care based optimization. However, such an occurrence is rare.) Toggle equivalence is different from the algebraic techniques, since it explores the ''Boolean'' options in the search space, while it differs from multi-level are connected in the same way) but the topology of subcircuits N i and N * i can be vastly different. The "language" of SPFDs is sufficient to express the notions of toggle implication and equivalence that we use in the paper. For example, to test that circuits M and M¢ are toggle equivalent one can build SPFDs of M and M¢ and check them for graph isomorphism. However, toggle equivalence of M and M¢ can be computed much more efficiently without building their SPFDs (by performing two SAT-checks). Moreover, the formulation of the TEP procedure in terms of SPFDs is hard at best. An SPFD is a relation between input assignments while the TEP procedure operates on output assignments and the same pair of output assignments (i.e. the same toggle) may be caused by an exponential number of pairs of input assignments. In a sense, the problem is that the definition of SPFDs was tailored to facilitate their computation from outputs to inputs, while in LS_TE and TEP procedure computations go in the opposite direction.
3.
Importance of LS_TE and the TEP procedure
In this section, we emphasize two important features of LS_TE and the TEP procedure. In Subsection 3.1, we show that LS_TE, in terms of equivalent transformations, can make moves that increase the size of intermediate circuits. This allows LS_TE to escape local minima that would trap a solution built by a traditional method of logic synthesis. In Subsection 3.2 we discuss the importance of the novel convergence scheme of the TEP procedure. is not an equivalent transformation. So LS_TE can get N out of a local minimum even by making transformations of "small scope". Suppose N implements the expression x 2 < 100 as shown in Figure 2 . Here subcircuit N 1 implements y=square(x) and subcircuit N 2 implements y < 100. (Let assume that the number n of bits in x is small enough to be handled efficiently.) LS_TE can optimize N as follows. 
Escaping local minima in LS_TE

3.2
Novel convergence scheme of the TEP procedure As we mentioned above, the importance of the TEP procedure is due to its enabling LS_TE. However, the TEP procedure is also important in its own right. Given a single-output circuit N, the TEP procedure can build a functionally equivalent circuit N * with a completely different topology. (So it can be used in "regular" logic synthesis without any relation to LS_TE.) This property is extremely important for at least three reasons. First, N may not have any topology to reuse (e.g. if N is specified as the truth table or is represented implicitly). Second, N may contain some non-local redundancy, which makes reusing its topology unreasonable. Third, one may need to implement N using a particular library of gates (e.g. in technology mapping) and the current topology of N may be not good for these library.
In the current synthesis methods, if the topology of N can not be reused for some reason, a new circuit N * is obtained from a very limited space of implementations (N * may be further optimized using local transformations). For example, in SIS [9] , if N is represented as the truth table, first, a circuit N * equivalent to N is synthesized as a sum-of-products (which is a very limited class of circuits). Then by local transformations a multi-level circuit is obtained from N * . (Another approach would be to build a circuit N * of multiplexers (i.e. build a BDD [3]) equivalent to N and then optimize it using some local transformations. BDDs is another example of a restricted class of circuits.)
The reason why current methods have to restrict the class of implementations considered when changing the topology of N is the "convergence problem". Suppose we build a circuit N * that does not use the topology of N. Then we have to make sure that the network of gates being built "converges" to a circuit equivalent to N. The TEP procedure solves this problem by introducing a very simple and general convergence scheme. Namely, it builds a sequence of circuits N 1 , N 2 ,… such that a) N i+1 toggles strictly less than N i and b) every circuit of this sequence toggles at least as much at the original circuit N. Here N 1 is an "empty circuit" consisting only of inputs of N. In other words, the TEP procedure builds a sequence of circuits that monotonically lose toggles until a circuit N m toggle equivalent to N is built. The TEP procedure also restricts the class of implementations it considers since it requires that only primary outputs of N i are allowed to feed the gates of N i+1 that are not in N i . However, this is a mild restriction in comparison to ones used by existing methods. So, the TEP procedure can select an optimized implementation from a very general class of multi-level circuits.
Preliminaries and terminology
In this section, we recall the notion of toggle equivalence and its properties. All the propositions given in this paper are either proven in [4] , or can be easily derived from them.
Toggle equivalence of Boolean functions Definition 1. Let f:{0,1}
n fi {0,1} m be an m-output
Definition 2. Let f 1 and f 2 respectively be two moutput and k-output Boolean functions with the same set of variables. Functions f 1 and f 2 are called toggle 
Implication of toggling
In this subsection, we introduce the notion of implication of toggling and describe how toggle equivalence and implication of toggling can be tested. Definition 4. Let f 1 : {0,1} n fi {0,1} m and f 1 : {0,1} n fi {0,1} k respectively be two m-output and k-output Boolean functions with the same set of input variables. Toggling of f 1 implies toggling of f 2 iff for any pair of input variable assignments x¢ and x †,
Definition 5. Let f 1 and f 2 be multi-output Boolean functions. Toggling of f 1 strictly implies toggling of f 2 if toggling of f 1 implies toggling of f 2 and there is a pair of assignments x¢ and x † such that
We will denote by f 1 £ £ £ £ f 2 (respectively f 1 < f 2 ) the fact that toggling of function f 1 implies toggling of (respectively strictly implies toggling of) f 2 . Let circuits N 1 and N 2 implement functions f 1 and f 2 respectively. We will denote by N 1 £ N 2 (respectively N 1 < N 2 ) the fact that 
Testing for Implication of Toggling.
Let N 1 and N 2 be two Boolean circuits to be checked for implication of toggling. (N 1 , N 2 ) is unsatisfiable, where S (N 1 , N 2 
Based on this, we can make the following three comments. 1) To test if N 1 £ N 2 , we simply test the satisfiability of S (N 1 , N 2 ) . If it is unsatisfiable (i.e. a constant zero), we conclude that N 2 ) is satisfiable, it means that there exists a pair of input vectors x and x * for which circuit N 1 toggles, while N 2 does not. 3) Let S(N 1 , N 2 ) be satisfiable. If we removed all toggles from N 1 that "are not in" N 2 , we would have N 1 £ N 2 . In other words, given two circuits N 1 and N 2 , we can define a function N 2 ) ) which returns toggles of N 1 that are not matched by toggles of N 2 . This is the set of toggles that must be removed from N 1 . If the resulting set ALLSAT (N 1 , N 2 ) is too large, its manageable subset can be used.
From Proposition 4, it follows that checking for toggle equivalence reduces to two satisfiability checks (henceforth called SAT checks).
Correlation function
In this section, we briefly introduce the notion of correlation function, to extend definitions of toggle implication and toggle equivalence to the case when functions f 1 is true that 
TEP procedure
The TEP procedure produces the circuit N 2 (given a combinational circuit N 1 ) in a topological manner from inputs to outputs. These operations are illustrated in Figure 3 In this way, the sequence of cuts C i that are produced, are topologically ordered. This means that for a pair of cuts C i and C p such that i < p no path from a primary input to a primary output of N 2 can traverse C p before C i , although C i and C p may have common nodes. Then, if a node in C p toggles for a given pair of input vectors, then there must be at least one node in C i that toggles as well. So just from the fact that C i and C p are topologically ordered it follows that current is generated, such that it has at least one less toggle than the previous N 2 current . This operation is performed by the function discard_toggles, which is described in the next subsection. Finally, redundant outputs of N 2 current are removed in the function remove_redundant_outputs. An output of N 2 current is redundant if, after its removal from N 2 current , the condition N 1 £ N 2 current still holds. Note that for each test for implication of toggling (i.e each "£"check), we utilize the SAT-based algorithm described in subsection 4.3.
5.1
Discard toggles from N 2 i Figure 5 describes the pseudocode of the discard_toggles procedure used by the TEP procedure ( Figure  4 ). The procedure discard_toggles consists of two parts. The procedures remove_toggles and add_toggles are explained in detail in the following subsections. In both these procedures, toggle removal and addition is done with AND gates, with their inputs appropriately complemented. = (y, y') .
discard_toggles(N
To remove the toggle r, we add an AND gate G. We consider two cases. Case i): If Y 1 = 1, then gate G has two inputs. One of these inputs is specified by the variable of Y 1 , and another input is chosen from Y 2 . All possible polarities of the second input are considered as well. The configuration for which G(y) = G(y')=0 and that removes the largest number of toggles of R * is selected.
Case ii): if Y 1 > 1, then gate G has |Y 1 | inputs. These inputs are connected to the variables in Y 1 , with appropriate polarity selection to guarantee that G(y) = G(y') = 0.
In both cases, the construction of gate G guarantees that G(y)=G(y')=0. After adding the gate G, we form the cut C new by removing from C current all the nodes in Y 1 and adding the output of G. Then, the toggle r =(y, y') is removed from the nodes of C new . The circuit resulting from this operation is called N 2 temp .
5.1.2
Procedure add_toggles. Unfortunately, adding the gate G (described in the previous subsection) may sometimes remove certain toggles that are required in N 1 . As a consequence, we have to perform a ''clean-up'' step, and add these toggles back into the design.
We begin with computing D, the set of toggles that need to be added. D is computed by find_toggle_setdifference(N 1 , N 2 temp ). The objective is to add minimum number of AND gates that reintroduce all toggles from D, and at the same time minimize the number of toggles that get re-introduced from R. It is not hard to prove that one can always reintroduce a toggle from the set D, by using a 2-input AND gate H, with appropriately selected inputs and input polarities, without re-introducing a toggle from the set R. The proof is omitted due to space constraints.
Once again, we have two cases to consider, analogous to those in the previous subsection: Case i): When the gate G added by remove_toggles was a 2-input gate, with |Y 1 | = 1, then one of the inputs of H is the same as the node in Y 1 . The other input of H is selected from among nodes in Y 2 . All possible nodes and polarities are explored to maximize the weighted cost function n 1 +p*n 2 . Here, n 1 is the number of toggles of R prevented from being re-introduced, n 2 is the number of toggles of D re-introduced and p is the weight parameter (that was set to 1 in our experiments). We add only those gates for which n 1 is 1 or more. Case ii): If |Y 1 | > 1 , then select the first input of H from Y 1 , and the second from Y, except the input already chosen as the first leg. The cost function to select inputs and their polarities is identical to the one explained in Case i above.
After each AND gate added to the circuit, the set D is recomputed. The routine add_toggles continues to add AND gates until the set D reduces to ˘. At this point the resulting circuit N 2 new_current is returned. It satisfies the property that N 1 £ N 2 new_current < N 2 current . Note that for a single gate added in remove_toggles, zero, one or more AND gates could be added in the following call of add_toggles.
Experimental results
Our preliminary implementation of the TEP procedure is in SIS [9] . We performed various experiments to compare TEP with traditional logic synthesis commands. The experiments were performed on a 3 GHz Xeon CPU, with 2GB of memory. Table 1 provides the results of applying TEP procedure and SIS for optimizing circuits implementing the expressions x 2 < C and C 1 *x < C 2 for different word sizes. (In contrast to the example of Subsection 3.1, the expressions above were optimized as one circuit i.e. by one call of the TEP procedure.) In all experiments, the value of C was chosen to be 200 (the results do not change much if one varies C). C 1 and C 2 were set to decimal value 11111. The two expressions above can be reduced to much simpler expressions x < C ¢ and x < C † respectively where C ¢ is equal to sqrt(C) and C † is equal to C 2 /C 1 . The objective of this experiment was to show that since TEP is structure-agnostic it can be used to simplify "non-local" redundancy. Note that although optimization of these expressions can be easily done manually, one can give examples of nonlocal redundancies that are much harder to find manually or by a program. Any logic synthesis procedure that changes the original circuit's structure locally (like SPFDs or don't care based optimizations) can easily get trapped in a local minimum. Note that only for smaller values of C, C 1 and C 2 , it is possible to build ROBDDs. For the experiments in Table 1 , we set the threshold of R * at 10 as explained in sections 5.1 and 4.3 i.e. R * contained only 10 (out of a huge number of) toggles to be removed. The reason why the TEP procedure worked so well with such a small subset R * was that by adding an AND gate to remove a toggle of R * explicitly, we may implicitly remove a huge number of toggles that were "skipped" in R * .
The first column in Table 1 represents the expression being simplified, while the second column represents the word size. Columns 3 and 4 represent the runtime and number of gates returned by script.rugged. Columns 5 and 6 represent the runtime and number of gates returned by collapse followed by script.rugged. The corresponding results for TEP are provided in Columns 7 and 8, while Column 9 represents the time taken to build a ROBDD (using the nanotrav package in CUDD). The notation ''Mem'' indicates a memory out condition. In all cases, the number of gates refers to the number of gates required after optimization and decomposition using AND2 and inverter gates.
We observe that the script.rugged requires significantly more gates than TEP. This is because script.rugged performs only local changes of the circuit and so SIS gets stuck in a local minimum. TEP, on the other hand, uses only the functionality of the circuit and so produces a dramatically smaller circuit. We may run collapse before script.rugged, to allow SIS to re-structure the logic better. However for all but the smallest word widths, collapse fails. Similarly, the ROBDD computation fails for large word widths, while TEP optimizes these circuits with less than 66 gates. Interestingly, the arithmetic expressions we used turned out to have "local redundancies" (however, in general, global redundancy of a circuit does not "translate" into local redundancies). So redundancy removal in SIS [9] can optimize them with comparable results by taking about two orders of magnitude more time than the TEP procedure. Table 2 shows the results of running a commercial tool (CT) on circuits produced by script.rugged and the TEP procedure. We used single-output circuits extracted from MCNC benchmarks. The objective of the experiment was to show that even for very small circuits, TEP can achieve better optimization. (The other reason for targeting small subcircuits is that in LS_TE, the TEP procedure is used for optimizing subcircuits N i of circuit N that are assumed to be small.) The first column of Table 2 shows names of circuits and the output number (in parentheses). The second column provides the number of inputs in the single output circuits. Columns 3 and 4 provide the mapped area and delay for the output of script.rugged mapped by CT, while Columns 5 and 6 provide these numbers for the TEP output mapped by CT. The standard cell library had 38 gates, implemented in a 0.18m process. The licensing agreement for CT requires us not to identify its name. The results of Table 2 indicate that TEP based circuits, after mapping, result in a 12.5% area improvement, and a 1.6% delay penalty over circuits optimized with script.rugged before mapping with CT. The TEP results improve on the script.rugged results for 85% of the examples in terms of area, and for 45% of the examples in terms of delay.
The objective of the experiment summarized in Table 3 was to provide a brief demonstration of the ability of LS_TE. The LS_TE method was used to optimize two-stage circuits. Both stages correspond to standard benchmark circuits, with the second stage being a single output circuit. The outputs of the first stage are inputs to the second. MCNC benchmarks rd84 and squar5 were used as the first stage circuits. The second stage circuits are single-output circuits extracted from MCNC benchmarks (second column). Columns 3 through 6 give the number of gates in optimized circuits and runtimes for optimization by script.rugged and LS_TE. form a circuit N * functionally equivalent to N modulo negation. Since we assume that N 1 and N 2 were designed independently, any output encoding for N 1 is in a sense as good as the original one. So the heuristics of TEP that aim at finding a toggle equivalent counterpart of N 1 that is as small as possible makes sense.
Note that the number of gates resulting from TEP optimization is significantly smaller than for SIS. In fact, on average, TEP requires 50.5% fewer gates than script.rugged.
Our current TEP implementation is unoptimized, and we have efforts underway to improve the runtimes of TEP. 
Conclusions
We have presented a new toggle equivalence preservation based procedure (TEP) for logic synthesis. This TEP procedure can be used in the scenario shown in Figure 1 . The idea is to resynthesize a circuit N (consisting of subcircuits N i ), in a manner that the high-level partitioning structure of N is retained. Each subcircuit N i is re-synthesized into a design N * i , using the TEP procedure. This resynthesis explores a huge optimization flexibility since the outputs of N i are re-encoded by TEP. This TEP procedure was formulated for multi-output circuits. The TEP procedure is structure-agnostic, unlike existing logic optimization procedures. Also, it is able to explore all possible output encodings efficiently during synthesis. For single-output circuits, toggle equivalence is the same as functional equivalence modulo negation. Therefore, we tested TEP on single-output circuits, to enable a fair comparison with existing synthesis approaches, although the full power of TEP is exhibited for multioutput circuits. The preliminary implementation of TEP is done in SIS, using a SAT-based computation. First results show encouraging improvements over SIS When the full power of TEP is utilized (for multi-output circuits) we expect yet further improvements.
8.
