This paper proposes a library-free technology mapping algorithm to reduce delay in combinational circuits. The algorithm reduces the overall number of series transistors through the longest path, considering that each cell network has to obey to a maximum admitted chain. The number of series transistors is computed in a Boolean way, reducing the structural bias. The mapping algorithm is performed on a Directed Acyclic Graph (DAG) description of the circuit. Preliminary results for delay were obtained through SPICE simulations. When compared to the SIS technology mapping, the proposed method shows significant delay reductions, considering circuits mapped with different libraries.
INTRODUCTION
Technology Mapping (TM) is the step of logic synthesis that chooses the cells that will be used to implement a circuit in a given technology. Normally, the cells are chosen from a precharacterized library [1] [2] [3] [4] . First methods for Technology Mapping used trees as the initial description of the circuit to be mapped. More recent methods are based on Directed Acyclic Graph (DAG) representations that allow duplicating logic to some extent to increase speed. Another important contribution to technology mapping was Boolean matching [5] , where the matching of a portion of the circuit and a cell from the library is done by comparing the Boolean function of the candidates, instead of the structure. Structural comparison would not be able to find all matches.
In the early phase of technology mapping, it was considered that the use of a cell generator [6] would enable the use of larger virtual (built on demand) cell libraries. Unfortunately, the use of such approaches was not widely verified in a commercial level, even if other references suggest that the increased number of cells in a library could lead to significant improvements in the quality of the final design [7] [8] [9] [10] . A recent approach presented in [11] suggests that the addition of some custom cells to a library can improve the speed of the final circuit. Recently, some methods for generating efficient cell networks were proposed [12] [13] [14] [15] , including a method [15] to compute the minimum number of transistors in series needed to implement an arbitrary Boolean function. These improvements were presented only at the cell level, lacking of an efficient method for mapping a larger circuit.
The contribution of this paper is to combine the method for Boolean computation of the number of series transistors presented in [15] with a state of the art technology mapping algorithm inspired by the approach presented in [4] . Significant gains are obtained in delay due to both aspects combined into the proposed mapping tool. The algorithm is library-free, as it chooses the transistor configuration for the cells that will have to be created through a cell generation tools in a subsequent step. This paper is organized as follows. Section 2 presents the background. It describes why the method used to compute series transistors is Boolean, and presents the rationale that relates series transistors to circuit delay. Section 3 presents the proposed algorithm. Results are presented in section 4 and conclusions are presented in section 5.
BACKGROUND
The method used in this paper to compute the number of series transistor in a cell network is a Boolean method. It computes the lower bounds [15] for the number of series transistors in the longest pull-up and pull-down chains of a cell implementing a given logic function. Boolean methods are able to overcome the structural bias [18] of the circuit being mapped, because they do not depend on the DAG structure, but only on the function being mapped. Another important point is that the associative methods to compute series transistor constraints used in [6, 7, 8, 9, 10] are monotonically increasing with the association, meaning the association of two functions will always have more transistors in series. The Boolean method is non monotonic, meaning the association of two functions can reduce the number of transistors in series. The differences between the Boolean method (lower bound) used in this paper and the well known complementary series-parallel approach (CSP) are highlighted in Table 1 .
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. For computing the lower bound we use a modified version of ESPRESSO-SIGNATURE [19] that does not generate the final SOP, only computes the length of minimum transistor chains. Our algorithm also uses a method to produce cell networks with minimum length transistor chains for the selected functions. Fig. 1 shows a comparison between a CSP and a lower bound implementation for the same function. Notice that the number of series transistors in the pull-up is reduced in Fig.1 .b, leading to a faster implementation. More details can be found in [15] .
LLWF ALGORITHM
The Library-Less Wavefront (LLWF) technology mapping algorithm is outlined in Fig. 2 . It uses matching and covering routines presented in Fig. 3and The labeled vertical dashed lines represent the successive heads of the wavefront. In the algorithm proposed here, the matching generation window is given by the wave_width. Hence, the pattern matches are generated in the interval [head -wave_width : head], as described by the algorithm in Fig. 3 . The covering algorithm relies on a function for computing the lower bound of series transistors. As this procedure needs to be fast, we developed a routine similar to ESPRESSO-SIGNATURE [19] . The covering algorithm is shown in Fig. 4 , and it minimizes the accumulated sum of transistor lengths along the critical path. This initial cost function is a first order approach, but our results show that it correlates well with circuit delay. A possible explanation is that the reduction of the number of series transistors decreases the logical effort [16] of the cells. In the future we intend to use a more precise delay model during the covering approach. For the moment our first order approach produces results that demonstrate the contribution of the method.
Procedure LLWF_mapper(max_pu, max_pd, wave_width) { remove_all_inverters(); levelize_circuit(); highest_level = highest level of the circuit; head_level = 1; while (head_level ≤ highest_level) { head_nets = list of all nets on head_level; foreach net, n in head_nets { /*Generate all matches considering a set of constraints*/ generate_matches(n, max_pu, max_pd, head, wave_width); /*Select the best match for the net n*/ covering_algorithm(n);} increment head_level;} add_inverters(); } Figure 2 . Main Algorithm.
Procedure generate_matches(n, max_pu, max_pd, head, wave_width) { -In the DAG, generate all the pattern matches for the net n, such that the search for pattern matches is performed in the interval [head -wave_width : head]; At this point, other constraints can be used to limit the match generation; foreach pattern match, pat in the list of pattern matches { /* Compute lower bound for pull-up and pull-down planes*/ cost_pu = compute_lower_bound_pu(pat); cost_pd = compute_lower_bound_pd(pat); if (cost_pu <= max_pu && cost_pd <= max_pd) { -Store pat as a logic_cell; -Make logic_cell, a driver of n in the DAG representation; -Connect all inputs of logic_cell to their correspondent nets; } } } Figure 3 . Matching Algorithm.
Procedure covering_algorithm(n) { -Compute the sum of pull-up for all cells driving net n; -Compute the sum of pull-down for all cells driving net n; -Select the cell with the lowest sum of pull-up and pulldown; -Disconnect all driver cells on n, except the selected cell, and perform a cleanup operation on their exclusive inputs; } In the algorithm, two constraints are used to validate the patterns: max_pu and max_pd. Both limit the maximum number of series transistors for the pull-up (max_pu) and pull-down (max_pd) chains for each match of a given circuit net. As an example of the application of the algorithm, assume the following constraint values: max_pu = 2, max_pd = 3 and wave_width = 3. Initially, the head starts at level 0 (primary input nets). It advances levelby-level and the match generation is done for all nets on the head level. After the matching generation for a given net n, the covering algorithm is immediately invoked to choose the best match for n. These steps will be repeated until the head reaches the highest level in the circuit. In the Fig. 5 .d, when the head reaches the level 1, the 2-Input NAND gates are added as drivers of their respective nets in the circuit, creating multi-source nets. After the covering algorithm selects the best match for the last net on level 1, the head is moved to level 2. As the wave_width is equal to the highest level of the circuit, on level 2 and 3, matches are generated until the primary inputs. Finally, considering the inversion flags in the circuit representation, inverters are inserted when it is necessary. Fig. 5 .b shows the mapped circuit under initial constraints. If the wave_width is reduced to 2, the circuit in the Fig. 5 .c is obtained. This shows how the wave_width affects the quality of the mapped circuit.
All cells in the final circuit are enforced to have an equal or smaller pull-up compared to pull-down. This is achieved by changing the polarity of cell inputs and outputs and exchanging pull-up and pull-down networks. Besides the constraints max_pu and max_pd, another set of restrictions can be used to avoid an excessive number of matches. For instance, the number of variables and/or the number of literals can also limit pattern matches. The matches are not limited to fanout-free (tree) regions; i.e. the match generation search process performs its search across fanout, since these nets are in the interval [head -wave_width : head]. However, it can be easily limited to fanout-free regions testing the fanout of each net in the search space. This technique can be used in order to save area.
RESULTS
In this section we present results of the algorithm developed in this paper. It was implemented in a tool called VIRMA-WF, which has been written in Java. All results were generated on a PC workstation running Windows XP using an AMD Athlon First study to analyze the effect of varying the wave_width sizes. The effect on delay is illustrated in Fig. 6 using four benchmark circuits and varying the wave_width from 1 to 6. As it happens for the original wavefront [4] , for widths of more than 4 the delay is constant or presents slight improvements. Since the width 4 is more practical, this wavefront size was chosen in order to realize the comparison between SIS and LLWF technology mapping algorithms.
The second experiment is a comparison between SIS technology mapping and our method. The circuits were first decomposed into inverters and 2-Input NAND/NOR gates using SIS. Next, we performed technology mapping, using SIS and our method, for all benchmark circuits. Finally, using our cell generator, NCSP and CSP CMOS transistor networks are derived for the mapped circuits produced by our method and by SIS, respectively. Results are shown in five different tables. In tables 2-6, the first columns show the name of the circuit. 
. (3,3)-T and (3,4)-T) indicates that the mapping was limited to trees (fanout free regions). The label D (e.g. (3,3)-D and (3,4)-D)
indicates that the mapping was allowed to duplicate logic, resulting on DAG mapping. Table 2 shows the accumulated sum of series transistors on the pull-up and pull-down planes of each cell on the longest path of the circuit. It is noticeable that VIRMA-WF reduces the accumulated transistor chains along the longest path.
In order to prove that the reduction of transistor in the pull-up and pull-down planes can reduce the circuit delay, we used SPICE simulation to estimate delay. The transistors used on the SPICE description have fixed size. Table 3 presents a delay comparison between the SIS technology mapping VIRMA-WF technology mapping. The second column (33-4 (ns)) shows delay values expressed in nanoseconds for circuits mapped by SIS. The columns 3-7 show normalized values correspondent to the delay values of the second column. Our method provides better results than SIS results, with average delay reductions of about 27% and 33% considering virtual cell libraries restricted by the constraints 3,3 and 3,4, respectively. The technology mapping limited to tree (T) regions performed by our method also shows improvements of 13%-15% in average.
Area comparison, considering the number of transistors of each circuit, can be seen in table 4. Due to logic duplications during the technology mapping, inherent to DAG mapping, our method can increase the area. The area penalty for using our technology mapping algorithm is 18% and 31% in average for the virtual cell libraries 3,3 and 3,4, respectively. There are cases where the average area increase is negligible. It happens for the technology mapping limited by fanout. Although for these cases the delay gains were not maximized, a good area/delay trade-off is still achieved. Table 5 shows the execution times for SIS and VIRMA-WF, given in seconds. The VIRMA-WF is more time consuming than SIS. As the time values show, they are not proportional to the size of the circuit. For instance, considering the virtual cell library (3, 4) , the circuit c499 uses more time than the circuit c3540. However, c499 is smaller than c3540. This is mainly due to the complexity of the lower bound calculus for each match, and also to the number of generated matches during the technology mapping process. (3,3)-T (3,3)-D (3,4)-T (3,4) -4 lib2 (3,3)-T (3,3)-D (3,4)-T (3,4) (3, 3) and (3, 4) , when DAG mapping is applied. However, the area penalty is high. The c6288 is a multiplier composed by regular logic blocks, and it has several regions that are not fanout-free. Therefore, best matches that cross fanout, will probably be best matches for other regions, resulting in many duplications. This area penalty can be reduced by allowing duplication of logic only for timing critical regions.
Figure 6. The effect of varying the wave_width
The prototype implemented to obtain the experimental results is devoted to prove our concepts. Our results show considerable delay gains. Nevertheless, area results show that we have to look for better area/delay trade-offs. We expect to find it by allowing duplication only in critical regions of the circuit. It can also decrease the CPU time, since the number of matches will be reduced. Another possibility to reduce CPU time is to store precomputed lower bounds in a hash table, to avoid repeated computations.
CONCLUSIONS
We have presented a library-less technology mapping algorithm to reduce delay in combinational circuits. A comparison among the tradition technology mapping using SIS and our method using different virtual cell libraries shows delay reductions from 6% to 48%. For some circuits, better delay means high penalty in area. The VIRMA-WF technology mapping limited to fanout-free regions produce circuits with negligible area increase and with delay improvements around 15% in average. In order to find a good trade-off between area and delay, the VIRMA-WF algorithm can be extended using a mix of tree-mapping on non timing critical regions and DAG-mapping on timing critical regions of the circuit, as suggested in the previous section.
The method presented here can be implemented in a nondisruptive way in existing design flows, if a cell/library generation tool is available. After mapping, the bespoke logic functions must be generated as cells to compose a library that is used for place & route and design closure of the mapped logic network. Future works will address this issue.
ACKNOWLEDGMENTS
This research was partially supported by CNPq/PNM and CAPES Brazilian Funding Agencies.
