Abstract-It is proposed a gate clustering technique to decrease the transistor count in a circuit. This enables an optimization willing to reduce area and power, specially leakage power. In a circuit netlist, some sets of connected cells with fanout one are clustered and replaced by a single cell with the same logic function, but with fewer transistors. To validate, we applied the technique to the ITC99 benchmark circuits synthesized to an 180nm vendor cell library. The results were compared with the original netlist regarding area, dynamic and leakage power. The number of transistors is reduced by 8 % in average, and the number of connections by 27 %. A significant power reduction is also achieved in function of the reduction of the transistor count.
I. INTRODUCTION
A large set of chip's designs, mainly ASIC ones, uses a traditional standard cell design flow. It is a well established flow based on the use of cell library. In modern technologies, mainly below 65 nm, the leakage power is becoming so important as the dynamic power. Even considering different methodologies to reduce leakage power, it is possible to say that leakage power is also dependent of the amount of transistors. It means that if we want to optimize leakage power it is needed to develop new methods to reduce the needed amount of transistors to execute an specific function [1] . We are working in two steps of the physical design flow, one is to optimize the transistor count at logic level, the other is the automatic design of the layout of any transistor network [2] . This work is related to the first step, reduction of the amount of needed transistors to execute a function, considering that any obtained transistor netlist can use an automatic layout generator of transistor network, as ASTRAN [3] .
In the standard cell design flow, a circuit is mapped using a set of logic gates from a cell library to avoid the complexity of a full custom layout design. In the technology mapping step, the logic synthesis tool tries to find the best mapping to the set of logic functions available in the cell library [4] . This is in essence a step of anti optimization since traditional cell libraries usually have not more than a hundred different logic functions and if we consider a limit of 4 PMOs and 4 NMOs serial transistors it is possible to have 3503 different functions allowing more options of cells in the optimization process [5] .
An alternative to the standard cell design methodology is a library free technology mapping, that uses a large virtual library instead of using a cell library [6] [7] [8] . The logic synthesis' optimization can be done by assuming that any optimized logic netlist can be mapped into silicon using network of transistors. This is possible if it is available a tool to generate the layout of any network of transistors. In a standard cell design methodology it is done a place and route of cells from a library, and in a library free approach it is done a place and route of transistors.
There are several advantages in implementing logic functions by using transistor networks instead of standard cells. As demonstrated in [9] , it is possible to have an important reduction of leakage power and delay. This is mainly achieved by the reduction of the number of transistors. Also, as shown in [10] , the use of transistor networks instead of standard cells gives an extra degree of freedom to explore different network topologies willing to optimize leakage power. Another big advantage is also the reduction of the number of connections and vias that helps a lot to reduce delay and area, as well reducing the routing complexity as there are less connections to be routed.
A mixed solution between standard cell and library free methodology is also a way to explore some advantages of both techniques, as it is shown in [11] . The standard cell methodology resulting netlist is refined through replacement of connected combinational gates of unitary fanout by a single logically equivalent gate, focusing on the reduction of the transistor count of the whole circuit. In this paper, we go further into this proposal, and it is evaluated the technique by applying it to circuits from the ITC99 benchmark. We synthesize them to a commercial cell library of 180nm, and also with any transistor network synthesized by ASTRAN cell layout generator, allowing to compare both approaches [2] . This paper is organized as follows. General aspects and illustrative examples of the technique are presented in section II. Section III explains the design methodology that was used, and the related algorithm. The results and conclusions are presented in section IV and V, respectively.
II. GENERAL CONSIDERATIONS
In circuits using CMOS technologies the transistors are normally organized as transistor networks, using complementary PMOS and NMOS : a pull-up set, composed by PMOS transistors, and a pull-down set, composed by NMOS transistors. It should be observed that it is also possible to explore non complimentary CMOS circuits.
Any combinational circuit can be implemented using an universal set of gates. Usually, just a relative small number of commonly used gates are added to a typical cell library, in order of a hundred of different logic functions. Most cell libraries also includes FF cells and some arithmetic ones. The cell library used in this work has 828 cells, including combinational and sequential cells. The set of cells is quite large as there are different versions for a function with different gate sizings. However, this only represents a small portion of all possible logic functions, as shown in Table I [5]. We call SCCG (Static CMOS Complex Gate) the gates that implement a logic function that has more than two levels of logic. AOI and OAI gates are common examples. As illustrated in Fig. 1 , using these gates it can lead to more compact circuits regarding transistor count, which can imply in a better area usage and less power dissipation. It is important to observe in the example that the reduction from 14 to 8 transistors to implement the same function is not the only benefice. There is also an elimination of 3 connections and 6 vias. This will help a lot in the routing step, as the number of connections to be done will be reduced. However, implementing a function using a SCCG is not an option in many cases because traditional cell libraries usually implement just a small set of such functions. A more detailed example is shown in Fig. 2 , which contains a portion of the B02 circuit from ITC99. On the left side, C is a primary input, and the colored gates obey to the described criteria. They will be clusterized and replaced by a new customized SCCG implementing the same logic function. Following, it is shown the 18 transistor network resultant of implementing the function F as it is in original netlist, and then a 12 transistor network that implements the same function performed by the set of the 3 original gates. On the right side, it is shown the logical view of the circuit after the gate replacement.
The transistor network related to the new SCCG gate on Fig. 2 has one of many possible structures to implement the logic function. There are works that focus on exploring the possible transistor network structures that implement a given logic function targeting different parameters, or yet minimizing the number of transistors needed [12] . In this work, we focus on finding functions that will originate the new SCCG rather than finding the best transistor network. For all the SSGC generated hereafter, we choose complementary series-parallel (CSP) transistor network structure.
Notice that some variables are inverted in the final logic function. This not necessarily implies in adding an extra inverter to the circuit, since the inverted signal may be available in the netlist, as in the signal A in the example. Additionally, the logic of the circuit that drives such inverted variable could also be back propagated until it reaches a temporal barrier (mainly as an FF output) or a primary input.
III. DESIGN METHODOLOGY
In order to explain the adopted methodology, consider the digraph presented on Fig. 3 . Vertexes are gates, arrows are wires. Arrows start from driver to sink, and a blue vertex is a fanout one gate. The method proposed in [11] works by replacing chains of gates of fanout one, taking chains as large as possible. This lead to gates with many inputs and many serial transistors, which may not be desirable.
In our approach we add two constraints to the clustering method: the number of serial transistor -in the pull up and in the pull down -, and the number of inputs. Such limitations still lead to a greedy solution, as the decision of considering or not a cell in the chain to be clustered is made locally and not backtracks. However, it is enough to bring more realistic new cells as a result of proposed clustering process.
As illustration, it is shown in Fig. 4 the resulting graph of applying the clustering technique with a limit of 4 serial transistors and 6 inputs. Yellow vertexes are clusters. Three clusters are added, all of them with 6 inputs but two with 4 serial transistors, and the other with 3 serial transistors. They replace 14 cells of original circuit, using 15.38% less transistors.
Notice that gate 14 is not clustered in SCCG 32 (Fig. 4) , although being part of the chain of connected gates of fanout one 9,6,14,18,27 (Fig. 3) . Similar situation to gate 20 in chain 4, 8, 11, 12, 19, 13, 20 ,25 that originated SCCG 31. This is due to the limitation in 4 serial transistors and 6 inputs settled to this example.
The clustering algorithm is presented on Algorithm 1. Starting with a fanout one cell as seed, while there are connected cells of fanout one, these cells are added if, by adding them, it does not trespass the established limits.
When a clustered gate is identified, it is necessary to design its layout. This task can be done by hand or by a layout generator tool as Astran [2] , as we did. It takes as input a spice netlist and a file with the technology parameters, and the output is the layout of the cell. The input spice netlist has to be sized, and we use logical effort for this. The next step is to do extraction and characterization of the new SCCG, that was done using Virtuoso and Encounter Library Characterizer (Cadence). Finally, we replace the instances of cluestered gates 
IV. GATE CLUSTERING RESULTS
In order to evaluate the proposed gate clustering methodology, we applied it to circuits from ITC'99 benchmark. We used Cadence Encounter RTL Compiler targeting a 180nm vendor cluster.add(seed) 3: seed.visited(true) 4: for all Gate g : getConnectedGates(seed) do return cluster 21: end function library. All circuits were synthesized to a flat design with a clock of 400MHz. The first results were targeting the transistor count reduction, the main concern of this paper. These results are summarized in Table II . We applied our technique to a netlist composed only by basic gates of a vendor library, without multiplexers and arithmetic cells.
By doing gate clustering, it reduced the number of instances by 29.68% in average, and it was also obtained an average reduction of 8.46% in the number of transistors. The most significant reduction of the amount of transistors was 15.38% in the case of b02.
Besides that, as it is also shown on Table II, our technique leads to an average reduction in the number of wires, even more significant than the reduction in the number of transistors. We achieve an average reduction of 27.67% in the number of wires, with a peak of 36.36% in b09. The reduction of the number of wires allows a reduction of routing complexity and this can provide a great impact on overall circuit performance since wire propagation delay is a major issue in modern technologies.
A main impact on reducing the number of transistors in is Table  III . Our approach reduces the dynamic power consumption by 13.17% in average, and leakage power consumption by 9.20 % in average. Note that for b07 circuit it is observed a small increasing in dynamic power but a significant 11.47% decrease in leakage power. This can be a desired result considering that for technologies under 180nm the leakage power is more and more relevant. As the 180nm used in the examples in far away from recent technologies as the 14 nm one, the absolute leakage power reduction will be much larger in under 20 nm technologies. But it is possible to accept that even showing an example using a 180 nm technology, the percentage in power reduction already shows the advantages of our approach, that will be greater when applied to under 20nm ones.
V. CONCLUSIONS
It was presented a technique to reduce the number of transistors of a combinational circuit. Its heuristics are based on replacing gate chains of fanout one by a logically similar one complex gate. The proposed technique showed to be effective, achieving an average reduction of 8.46% in number of transistors when compared with an original netlist. Our technique is also able to reduce up to 36.36% the number of wires of the original netlist, which can improve routability and delay in modern technologies. Also, we show that by reducing the number of transistors we can reduce power, achieving an average reduction of about 13% and 9% for dynamic and leakage power, respectively.
