Restructuring techniques for And-Inverter Graphs (AIG), such as rewriting and refactoring, are powerful, scalable and fast, achieving highly optimized AIGs after few iterations. However, these techniques are biased by the original AIG structure and limited by single output optimizations. This paper investigates AIG optimization for area, exploring how far Boolean methods can reduce AIG nodes through local optimization. Boolean division is applied for multi-output functions using two-literal divisors and Boolean decomposition is introduced as a method for AIG optimization. Multioutput blocks are extracted from the AIG and optimized, achieving a further AIG node reduction of 7.76% on average for ITC99 and MCNC benchmarks.
INTRODUCTION
Logic synthesis is a key process in design automation, generating an optimized netlist of logic gates from an RTL representation, and it is often divided in two classes: technology independent and technology dependent [6] . Recently, technology independent algorithms using AIGs have been proposed, enabling efficient and scalable optimizations [10] .
Restructuring methods such as refactoring [10] , rewriting [13] , and balancing [5] are powerful, obtaining highly optimized AIGs after few iterations. Still, these techniques are usually constrained by single output transformations, and iterations with technology mapping [7, 12] are often used to improve structurally biased results.
AIG rewriting and refactoring perform local transformations, extracting the local context with K-cuts [16] , windows or maximum fanout-free cones (MFFCs). K-cuts can be considered a superior method to extract local context compared to windowing [7, 10] , as it is possible to control the support of the Boolean functions to be optimized, while identifying a region of the circuit that depends on this support.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.
GLSVLSI '17, May 10-12, 2017 , Banff, AB, Canada. Algorithms based on K-cut enumeration have been proposed, such as factor cuts [4] and priority cuts [14] , reducing the search space and enabling cuts with more nodes and inputs. Also, multi-output blocks based on K-cuts were presented [8, 9] , which extract the complete local context. This paper studies technology-independent transformations that reduce the AIG size by exploring the use of Boolean decomposition. This is done by expanding the idea of twoliteral divisors [15] to multi-output functions (see Fig. 1 ). The principle is as follows. A multi-output function (y1, . . . , ym) = F (x1, . . . , xn) can be decomposed into another multi-output function (y1, . . . , ym) = G(x1, . . . , xn, z) with z being a two-literal divisor (z = xi xj) and being a Boolean operator (such as an AIG node). G is supposed to be a simpler function than F , which can be obtained by Boolean division. A multi-output Boolean function can be recursively decomposed using this paradigm, and the result can be represented as an AIG network.
The main purpose of this paper is to explore how far Boolean decomposition can go beyond the existing AIG rewriting methods. Unfortunately, scalability is an important issue when dealing with Boolean methods. Obviously, collapsing large networks into one-node functions and then decomposing is not computationally affordable.
This paper takes a significant leap forward regarding [15] :
• Boolean decomposition with two-literal divisors is generalized to be applicable to netlists with multiple outputs, instead of individual single-output functions.
• The selection of divisors is customized to increase the logic shared among multiple outputs.
• A set of filters to reduce the search space is presented.
• Scalability is addressed by iteratively applying Boolean decomposition to KL-cuts [9] of the AIG (see Fig. 2 ). Note that a KL-cut represents a portion of the circuit that depends on the same set of variables. The key idea is to use this property of KL-cuts to identify divisors that are useful for several functions, sharing more logic and reducing area.
In [11] , a resynthesis method that uses Boolean decomposition is performed on netlists of FPGA LUTs, by identifying and decomposing MFFCs. However, [11] does not perform a complete Boolean decomposition, as it is limited to a simplified version of Disjoint Support Decomposition. In [17] , windows are enumerated and don't cares identified in the network are used to simplify these windows, while in [18] the algebraic decomposition method is improved by assigning previously calculated don't cares using a set of rules. In this work, there is neither search nor previous calculation of don't care conditions, as the optimization using don't cares occurs solely inside the KL-cut logic.
The results reported in the paper have been often obtained by applying computationally intensive methods (e.g., many divisors, many cuts). Bear in mind that our goal now is to establish the bounds potentially reachable by future work that could use smart oracles to drive the search. Some preliminary criteria are discussed in the paper. Still, the method proposed in this paper can be interesting for highly repetitive structures or area-critical components.
MULTI-OUTPUT DECOMPOSITION
Boolean decomposition is known to obtain good-quality results at the expense of a high computational cost. Finding good divisors is the most challenging task. Different approaches can be proposed to prune the search, e.g., use two-literal divisors [15] , consider polarity information to ignore unpromising divisions, or reuse algebraic factored forms to select divisors. Considering all these simplifications, is it still possible to obtain good results?
To illustrate the method, consider the set of functions and its DC-set in (1) . Any combination of two literals of F can be selected as a Boolean divisor. The number of literals is defined as the cost function, therefore the cost of F is 12.
A reduction of 3 literals is achieved by performing a multioutput Boolean division using the divisor y = a·b, generating the new set of functions and DC-set in (2) . Note that this divisor is not easily extracted from the set of functions F : f1 does not have the variable a and f2 does not have the variable b. Still, all functions are reduced in 1 literal due to the effective use of the DC-set.
For further decomposition steps, it is possible to perform a DC-set projection, removing variables from the DC-set that are not in the support of F , and decreasing the computational effort of the Boolean division. By projecting DCnew to the support of Fnew, DCproj =b · y is generated.
AIG optimization example
In order to demonstrate the potential benefits of the approach introduced in this paper, the circuit b06 of ITC99 Figure 3: Optimization flow using different methods for b06.
benchmarks suite [1] is used. The optimization flow and the results for the different solutions are depicted in Fig. 3 . The input circuit is a Boolean network represented in a BLIF file. An AIG with 42 nodes is obtained after decomposing the Boolean network with algebraic factorization and structural hashing (strash command in ABC [3] ). After iteratively applying algebraic transformations using dc2 command in ABC, the number of nodes is reduced to 35.
An alternative approach would start by collapsing the initial network, which results in a Boolean network with one node for each output. Decomposing these nodes results in a larger AIG with 47 nodes, but applying iterative algebraic transformations reduces it to 31 nodes.
The method proposed in this paper applies Boolean decomposition on top of the collapsed network using an approach inspired by [15] . In this work, multi-output decomposition is used by iteratively selecting the Boolean divisor that minimizes the literals of the functions factored forms [2] . This approach is able to achieve a better logic sharing, obtaining an AIG with only 24 nodes.
Also, a set of filters is applied to reduce the search space of divisors. By reducing the amount of two-level minimizations compared to [15] , runtime is reduced 40 times on average, without sacrificing the quality of the results. Note that the results in this work cannot be directly compared with [15] , which only presents the decomposition of single-output functions and it is not applied for circuit optimization.
BACKGROUND 3.1 Functions, unateness and containment
A completely specified Boolean function f is a mapping from an n-dimensional (n ≥ 0) into a 1-dimensional Boolean space: {0, 1} n → {0, 1}. The positive (negative) cofactor f |x=1 (f |x=0) of f with respect to the variable x is a function obtained by assigning x to one (zero) in f .
A Boolean function f is positive (negative) unate in the variable x if f |x=1 ⊇ f |x=0 (f |x=0 ⊇ f |x=1), where ⊇ is the set operation for inclusion. Otherwise, f is considered binate in x. This is the concept of unateness [6] , intended for completely specified functions.
An incompletely specified function (ISF) g is a mapping from an n-dimensional (n ≥ 0) into a 1-dimensional Boolean space: {0, 1} n → {0, 1, * }, where * denotes a don't care value. The subdomains of g that evaluate to 1, 0 and * are the ON-set, OFF-set and DC-set, and can be represented by the completely specified functions gon, g off and g dc .
Containment [19] is a generalization of the concept of unateness for ISFs. The variable x in the positive polarity is contained in g if (g dc |x=1 ∪ gon|x=1) ⊇ gon|x=0 and in the negative polarity if (g dc |x=0 ∪ gon|x=0) ⊇ gon|x=1, where the operators ∪ and ⊇ are the set operations of union and inclusion, respectively. 
And-Inverter Graphs and K-cuts
An And-Inverter Graph is a directed acyclic graph where each node has either 0 incoming edges -the primary inputs (PI) -or 2 incoming edges -the AND nodes. Each edge can be negated or not. Some nodes are also primary outputs (PO). Sequential elements are considered as PI/PO pairs. An AIG example is depicted in Fig. 4 . The dotted lines are negated edges, the circles are AND nodes, the rectangles at the bottom are PIs and at the top are POs.
A cut of a node n in a graph G is a set of nodes c such that every path between a PI and n contains a node in c. A cut is said to be irredundant if no subset of it is also a cut. A K-cut [16] of a graph G is an irredundant cut of K or fewer nodes. Consider the two sets of cuts A and B and the auxiliary set operation described in (3) .
set operation removes the redundant cuts, and it is commutative, as the union set operation ∪ is also commutative. Let ΦK (n) to be the set of K-cuts of n ∈ G and, if n is an AND node, let n1 and n2 to be its inputs. Then, ΦK (n) is defined recursively [4] , as described in (4) .
KL-cuts
K-cuts can be an efficient way to represent a graph region regarding a single output. However, several K-cuts may be necessary to cover regions with multiple outputs, duplicating logic. A KL-cut [8, 9] identifies a multi-output region in order to overcome this issue. A KL-cut is a sub-graph G KL of a graph G with K inputs and L outputs. It is represented as two sets of nodes: the inputs GK , and the outputs GL.
If a node n belongs to a path between nK ∈ GK and nL ∈ GL, being n / ∈ GK , then n is contained in GKL. Notice that all nodes in GL are contained in GKL, and GKL does not contain any node of GK . In this work, the number of outputs is not restricted in KL-cuts enumeration. Therefore, for every K-cut of a node n, there is a unique KL-cut GKL.
The nodes that are part of GKL are identified by traversing forward the graph G from GK . A node n is part of GKL if at least one of the K-cuts of n is a subset of or equal to GK . A node of GKL is contained in GL if it is also a PO, or if it has a fanout to a node not contained in GKL. KL-cut enumeration example: Figure 4 depicts GK = {3, 4, 5, 6, 10, 13} with its nodes in light gray, which is one of the K-cuts at node 31, in dark gray. The KL-cut GKL is obtained by traversing the AIG forward from GK , identifying the sub-graph hatched in Fig. 4 . Nodes 31 and 40 are also POs, and nodes 12 and 33 have fanout to nodes not contained in GKL, therefore GL = {12, 31, 33, 40} is defined. Note that the logic of GL nodes can be described as Boolean functions that depend on the same support: the GK nodes.
Boolean division
Given the Boolean functions f and d, if it is possible to express f as f = q · d + r, where · and + represent the Boolean AND and OR operators respectively, then f can be divided by d, with q and r being the quotient and remainder of the division, respectively. This division operation can be performed by algebraic or Boolean methods.
A common approach to perform Boolean division is using two-level minimizers that accept don't care information [6] . A new variable x is added and the division is performed by adding the satisfiability don't care (SDC) expression x⊕d to the DC-set of f , where ⊕ represent the Boolean exclusive-OR operator, followed by a two-level minimization. 
AIG TRANSFORMATIONS RESULTS
This section presents the AIG size reduction achieved by AIG transformations for a set of ITC99 and MCNC benchmarks [1] . Each benchmark is read and tranformed into an AIG through algebraic factorization. Then structural hashing is performed, obtaining the number of AIG nodes (N) and levels (LV) shown in column "Initial " of Table 1 .
1: procedure booleanDecompositionAIG(AIG, cutParams)
Input: An AIG network, and parameters to enumerate KL-cuts
2:
for each node N in AIG in topological order do
3:
for each kcut C of node N from kcut enumeration do
4:
obtain the klcut from C in AIG
5:
if the klcut is accepted based on cutParams then 6: 
9:
if size of divisors < size of klcut then
10:
new klcut = AIG network of divisors
11:
replace klcut by new klcut in AIG
12:
restart kcut enumeration Figure 5 : AIG optimization using Boolean decomposition.
In order to obtain a highly optimized AIG, the dc2 command is executed iteratively until no changes are observed. This reduces the number of nodes for the majority of cases, as seen in column "After dc2 ". The number of levels is usually reduced, but not for all cases. Geometric mean I and the ratio 1 refer to the complete set of benchmarks.
An alternative experiment is performed by first collapsing the netlist, just after reading the input file. The collapsing operation cannot finish for all benchmarks due to its complexity. In the cases it can finish, the AIGs are obtained through algebraic factorization and structural hashing. Then, the dc2 command is run iteratively until no changes are observed, generating the results shown in column "After collapse + dc2 ". Geometric mean II and the corresponding ratios 2 and 3 refer to the benchmarks in which collapsing could finish.
The final number of nodes varies from a 89% reduction (clma) to a 792% increase (b05 ). Collapsing the netlist may result in a larger AIG (see Sect. 2), and the AIG transformations may not obtain the same results as before collapsing, as these modifications are biased by the AIG structure. In the b05 benchmark, the initial description has very good logic sharing between outputs, which is lost after collapsing. The shared logic is not recovered due to local optimization limitations, as some of the benchmark outputs depend in a large set of inputs. For other cases, collapsing enables significant AIG reduction by removing redundant logic.
AIG OPTIMIZATION APPROACH
This section introduces a new AIG optimization approach, based on Boolean decomposition with two-literal divisors [15] . Boolean methods are known to be inefficient and not scalable, but also to obtain better results when compared with algebraic methods. In this work, the Boolean decomposition method [15] is applied to multi-output functions and runtime is improved without losing quality of results. Still, the algorithm is not scalable for large circuits, and therefore applied via local optimization.
AIG local optimization using KL-cuts
A pseudo-code of the AIG optimization strategy is shown in Fig. 5 . The procedure booleanDecompositionAIG receives the AIG and the parameters to enumerate the KL-cuts, which define the number of nodes and inputs, for example.
From the outputs to the inputs, each node is visited (line 2), and for each node all K-cuts enumerated based on the parameters are tried (line 3). In line 4, the KL-cut is obtained from the K-cut, and if accepted based on the parameters (line 5), the Boolean decomposition is performed on top of the KL-cut Boolean functions (line 8).
If the Boolean decomposition result is smaller than the 
22:
Flits, Rlits = number of literals in f factored form
23:
if d is accepted based on f vars polarities then
24:
/* R(f ) is the result of the division of f by d */ 25:
26:
Rlits = literals in R(f ) factored form
27:
if Rlits > Flits then
28:
/* If R(f ) has more literals, or division was not
29:
performed, select f as part of the solution */
30:
R(f ) = f ; divLiterals += Flits
31:
else 32:
divLiterals += Rlits
33:
/* If division reduced literals, update best result */
34:
if divLiterals < numLiterals then
35:
for each non-trivial f ∈ F do 36:
37:
38: 39:
/* Step (IV) -Set the functions for next recursive call */
40:
for each non-trivial f ∈ F do 41:
42:
43:
/* Return the best divisor and make a new recursive call */
44:
return {bestDivisor } ∪ boolDecompose(newF, newDC ) Figure 6 : Boolean decomposition procedure.
number of nodes of the KL-cut, it replaces the logic in the AIG (line 11). Note that boolDecompose returns the set of divisors used to decompose the KL-cut functions, which can be easily translated to an AIG network. Also, since the AIG is changed when there is a KL-cut replacement, the previous K-cut enumeration has to be restarted (line 12).
Boolean decomposition
The algorithm for boolDecompose is presented in Fig. 6 , which recursively performs the Boolean decomposition on a set of functions F . The algorithm is divided into four steps: (I) detection of trivial cases, (II) generation of candidate Boolean divisors and definition of the cost function, (III) selection of the best divisor via Boolean division, and (IV) preparation of the next recursive call.
At line 3, detection of trivial cases (I) is performed, identifying when all functions in F are decomposed. The algorithm is executed recursively until this condition is satisfied.
Step (II) starts by obtaining the algebraic factored form for each function in F (line 9). The cost function to be minimized is defined as the sum of literals of all functions in factored form (numLiterals, line 10). Then, the two-literal leaves of the factored form trees are generated as candidate divisors for each function in F (see Fig. 7 ). Boolean division is performed for all functions in F using each divisor in D, in order to calculate the cost function for all divisors. The best divisor (III) is the one that achieves the largest reduction in number of literals.
As seen in Sect. 3.4, Boolean division is performed by adding the SDC of a divisor to the DC-set of a function and running two-level minimization. As the DC-set may contain variables not relevant to the division, a DC-set projection is performed to the support of the function f and divisor d (line 20). The projected DC-set is accumulated with the SDC generated by the evaluated divisor (line 21), generating the DC-set used in the two-level minimization (DC div ).
The division may be filtered due to the polarities of the variables (see Sect. 5.3) . If the division is filtered or if the number of literals in the division result R(f ) is greater than in f (line 27), then f is used as part of the current solution (line 30). If the number of literals is reduced after a division (line 34), the best solution is updated: the division result (line 35), the divisor and the number of literals (line 37).
The next iteration is prepared at step (IV). The ON-set (line 41) and the DC-set (line 42) obtained by the best divisor are used in the next recursive call (line 44).
Note that the two-level minimization could be performed as a multi-output function. However, running single output two-level minimizations is preferred as it is more efficient, divisions can be filtered based on the variables polarities and the divisions that increase literals can be discarded for each output individually.
Filters to reduce runtime
Select divisors from factored forms. In comparison to [15] , only pairs of literals obtained from the leaf nodes of the factored form trees are selected as potential divisors, instead of trying all possible pairs of variables and polarities. The divisors are selected from all output functions of the KL-cut. For the benchmarks analyzed with our method, the quality of results was not affected by applying this filter, while the optimization runtime was significantly improved.
To illustrate the divisor selection, factored form trees obtained from b06 functions are depicted in Fig. 7 . The twoliteral leaves highlighted in Fig. 7 are the divisors selected. Notice that only one polarity is investigated, e.g., if the divisor c +ē is chosen, its negated versionc · e is disregarded. Use variable polarity information. This filter is used to avoid exploring divisions with unpromising polarities between the divisor and the divided function. Table 2 describes the divisors that are accepted based on its support and the polarities of the variables in the divided function. A total of 849 Boolean divisions are performed during the Boolean decomposition of b06 when using this filter versus 943 without it (and 13829 divisions would be done without any filter).
The polarity of the variables can be obtained using the concept of unateness, which is defined for completely speci- Table 3 shows the AIGs metrics (N: nodes, LV: levels) before and after Boolean Decomposition. Column Initial reports the metrics of the AIGs after input and structural hashing, and column ABC smallest reports the AIGs with least number of nodes from Table 1 .
EXPERIMENTAL RESULTS
The AIG optimization via Boolean decomposition is applied on top of the AIGs of column ABC smallest. KL-cuts with K=8 and unbounded L are enumerated in order to obtain smaller parts of the AIG with complete local context. Also, the number of nodes of the KL-cuts is restricted to 30, therefore having a very limited scope of optimization. Boolean decomposition is performed on top of the KL-cut outputs functions, only replacing the KL-cut logic if the number of nodes is reduced. The experiments were run on an Intel Core i7 processor with 4GB of RAM. All AIGs passed formal verification using ABC command cec.
The column "Boolean Decomp." reports the results obtained after performing two iterations of the Boolean decomposition method. By using Boolean decomposition, it was possible to reduce the number of AIG nodes further by 7.76% on average, with important results such as 25.81% (b06 ), 16 .95% (spla), 15.84% (bigkey) and 14.8% (clma).
Our approach is able to identify a better logic sharing, therefore increasing the number of levels, which is not con-trolled by our method. Still, there is an increase of up to 1 level for 15 out of 25 benchmarks evaluated. Also, the average number of levels is still smaller than the Initial results. Table 4 shows technology mapping performed for FPGAs and standard cells using the AIGs from Table 3 . Area reduction was observed simply by changing the input with smaller AIGs. Mapping to LUT4s was performed with the ABC command "if -K 4 ", obtaining 5.5% area reduction on average. Mapping to standard cells was performed with the ABC command "map" using the library "GSCLib 3.0.lib" of [1] , obtaining 6.2% area reduction on average.
CONCLUSIONS
This paper introduced an approach to explore the potential use of Boolean decomposition in the optimization of AIGs. The experiments show promising results, with an average reduction of 7.76% in AIG nodes. Scalability is one of the aspects that requires more investigation. We envision a synthesis system in which smart oracles could guide the search for divisors based on simple correlation metrics between functions and divisors.
As future work, there are some directions that could be explored. For example, different types of cuts and more combinations of divisors could be studied. Using other models of flexibility (Boolean relations) could also be considered. Delay is another important aspect that is not considered in this work, but could be incorporated, controlling the number of levels and reducing the resulting circuit delay.
ACKNOWLEDGMENTS
We thank Mayler Martins for helpful discussions. The present work was performed with the support of CNPq, Conselho Nacional de Desenvolvimento Científico e Tecnológico -Brasil. This work has been partially supported by funds from the Spanish Ministry for Economy and Competitiveness and the European Union (FEDER funds) under grant TIN2013-46181-C2-1-R, and the Generalitat de Catalunya (2014 SGR 1034).
