Abstract-The Lagrangian relaxation (LR) based gate sizer proposed in [1] has the best leakage power results published so far for the ISPD 2012 Gate Sizing Contest benchmarks. However, it requires many LR iterations and does not rely on any technique to perform cell option candidate filtering in the LR subproblem solver. Therefore, this paper presents some extensions to address these drawbacks. In order to reduce the number of LR iterations, we propose some enhancements to the original LR multiplier formula. We also use a scaling factor to properly scale timing cost and leakage power in the LR local cost. Moreover, we apply a cell option candidate filtering strategy to reduce the runtime of each LR iteration. Finally, we improve the post-processing timing recovery and power recovery. Our work achieved leakage power results very close to the original algorithm, taking 4.28× fewer LR iterations, on average, and 9.11× fewer cell swaps during LR, on average.
I. INTRODUCTION
As the technology of digital integrated circuits manufacturing process advances, industry faces new challenges. The shrinking of the devices and the continuous grow in the number of transistors in a chip cause the leakage power reduction to be an unavoidable concern. If compared to the planar technology, the FinFET improved the leakage power [2] , however it has not eliminated this power component. Therefore, optimization techniques addressing power minimization still play a major role.
In standard cell based designs, the gate-level netlist is mapped to a technology library, which contains different implementation options for each available logic function. Considering a particular logic function, the versions differ in size and threshold voltage, hence, they have different timing and leakage power characteristics.
In order to optimize the leakage power, the discrete gate sizing technique may be employed in the design flow. It consists in assigning to each gate of the netlist an appropriate cell option such that the specified timing is met and some metrics, e.g. leakage power and area, are optimized. The discrete gate sizing is a combinatorial problem and [3] proved to be NPHard, thus, efficient heuristic algorithms are fundamental to address this problem.
Many approaches have been proposed to tackle the gate sizing problem, such as linear programming [4] , dynamic programming [5] , geometric programming [6] , sensitivity guided metaheuristic [7] , and Lagrangian relaxation [5] , [8] , [9] , [10] , [11] , [12] , [13] . State-of-the-art works rely on LR due to its effectiveness to produce good solutions. However, LR based sizers can take many LR iterations to produce a good coarse-grained solution in terms of leakage power and timing, e.g. [1] , which needs 100 iterations. [13] addressed this issue by proposing a Lagrange multiplier update strategy that accelerates the convergence of the algorithm. Besides the many iterations drawback, the evaluation of all cell options to compute the lowest LR local cost can be very slow, specially if a complex timing model is used. Furthermore, in the initial LR iterations, the leakage power tends to grow significantly, e.g 15x the final value [1] . So, several iterations are required to reduce this power increase. Considering that a design can have a large number of gates, a coarse-grained LR based sizer can be very timing consuming if these drawbacks are not properly handled.
Thus, in this work, we extended the gate sizing tool proposed in [1] to cope with the drawbacks of the LR. As in [13] , we divide the LR phase into timing recovery LR (TR-LR) and power reduction LR (PR-LR). The TR-LR is focused on timing violations fixing, while the main goal of the PR-LR is to optimize the leakage power of the solution. By dividing the LR into two stages, we can apply into the PR-LR a cell option candidate filtering without significant impact in leakage power results. Also, we extended the Lagrange multiplier update formula to accelerate both TR-LR and PR-LR. Furthermore, we use the average delay per leakage factor, which properly scales timing and leakage power in the LR local cost. Thus, the extended multiplier formula along with the average delay are able to control the leakage power blow up during the TR-LR phase. Yet, we modified the timing recovery post-processing phase to reduce its runtime. Finally, we propose the arrival time sensitivity metric to classify cells that are good candidates to be changed in the post-processing power recovery.
The rest of this paper is organized as follows. Section II provides several definitions used in this paper. In section III, it is presented the overall flow of the discrete gate sizing tool proposed in this work. The strategies adopted to improve the LR core of the sizer are discussed in Section IV. Sections V and VI present the enhanced timing recovery and sensitivitybased power recovery, respectively. The experimental results are shown in section VII. Section VIII concludes this work. Table I presents the definition of several terms used in this work. [1] , our gate sizer has 4 main phases: initialization to precondition the solution, an LR based sizer, timing fixing and power reduction.
II. DEFINITIONS
In the initialization phase, the algorithm starts by assigning each gate to the lowest leakage cell option. Next, load and slew violations are removed using the method presented in [8] (with α = 0.7). Hence, a solution which has only timing violations is passed to the Lagrangian relaxation phase. Before starting the LR, the Lagrange multipliers of each timing arc are initialized to 1.
After performing the preconditioning of the solution, the LR phase comes into play. The timing recovery LR quickly reduces the timing violations. Next, the power reduction LR performs the leakage power optimization.
If any timing violation is left after the LR terminates, a post-processing timing fixing algorithm is executed. Finally, in order to reduce even further the leakage power, an algorithm based on arrival time sensitivity is executed. Figure 1 depicts the overall flow. IV. IMPROVEMENTS IN THE LR CORE OF THE BASELINE WORK The main optimization phase of our flow is based on the Lagrangian relaxation, as in the baseline work [1] . Although the baseline work [1] has the best leakage power results published so far for the set of benchmarks of the ISPD 2012 Contest on Discrete Gate Sizing [14] , it presents some drawbacks in its LR core. The first point is that the algorithm requires an expressive number of iterations (100) to converge to a good solution. A portion of the iterations is used to reduce the huge increase in leakage power (e.g. 15× the final value) that occurs in the first iterations of the LR. Also, the algorithm does not rely on any cell option candidate filtering strategy, thus, computing the lowest LR local cost can be a slow process. Therefore, throughout this section, it is presented a set of strategies used to tackle these drawbacks. The objective of the proposed strategies is to reduce the number of iterations required by the LR core and dramatically reduce the number of cell options evaluated during the LR phase.
A. Updated Lagrangian Relaxation Formulation
For each gate of the circuit, the Lagrangian relaxation subproblem solver of the baseline work greedily selects the cell option that locally minimizes the sum of timing cost and leakage power cost. However, these two quantities are from different natures, therefore, they must be scaled. A consequence of summing up these two quantities without proper scaling them is the huge increase in the leakage power during the initial LR iterations. This impacts in the runtime, since part of the iterations will be used to reduce the leakage peak. Also, the LR slowly adjusts the Lagrange multipliers such that a scaling factor would not be necessary. However, it is a waste of runtime.
Initially, the solution has excessive timing violations, therefore, the LR local cost is highly dominated by the timing cost. By scaling timing and power costs, the importance of the leakage power is increased, therefore, the LR core tends do be less aggressive in the initial iterations, reducing the excessive increase in the leakage power. In this work, the scaling is performed by multiplying the leakage power cost by a librarydependent factor α called average delay per leakage, whose computation is detailed later in this section. Thus, by defining α as delay/leakage, only quantities with the same unit are summed up in the cost function, as shown below:
Therefore, the LR formulation used in this work is defined as:
LRS:
minimize α × leakage + λ i→j delay i→j (2)
B. Splitting of the Lagrangian Relaxation Core
As in [13] , the LR phase is divided into two main stages: timing recovery Lagrangian relaxation (TR-LR) and power reduction Lagrangian relaxation (PR-LR). During the TR-LR, the optimization is focused on fixing most of the timing violations. In order to quickly reduce the number of endpoints with negative slack, the Lagrange multiplier formula is adjusted to emphasize delay over leakage power. The PR-LR starts as soon as the TR-LR terminates. This stage is focused on power optimization. Thus, the Lagrange multiplier formula is adjusted to enable a quickly power reduction without harming too much the timing of the circuit.
The LDP solver for the TR-LR is presented in Algorithm 1. It starts by setting each Lagrange multiplier to 1. As stated before, the TR-LR is focused on quickly removing most of the timing violations. Hence, at each iteration, the Lagrange multipliers are updated such that a rapid convergence is obtained and the leakage power does not dramatically blow up. The Lagrange multipliers update formula used is similar to the one used in the baseline work. The only difference is that an exponent k is used. Therefore, the formula becomes:
The values for the exponent k were empirically obtained in several experiments performed using the benchmarks of the ISPD Contest 2012 on Discrete Gate Sizing [14] . The exponent k was determined to be 3 for critical timing arcs in order to emphasize the delay over the leakage power. For critical cells that have most of the critical paths passing through them, referred here as bottleneck cells, k was determined to be 10. It was empirically established that a cell is bottleneck if its centrality -an estimation of how many and how critical are the endpoints affected by a cell [15] -is equal or greater than 0.9. By handling bottleneck cells in a special manner, it is expected that several critical paths are quickly fixed. This exponent distribution, alongside the α scaling factor discussed earlier, mitigates the leakage power blow up during the TR-LR stage, since the effort of the LR will be more aggressive on the critical cells, specially on the bottleneck ones. Table II summarizes the exponent values used during the TR-LR stage. The Lagrangian relaxation subproblem is solved using the same algorithm presented in [1] . This method was adapted to consider the updated LRS formulation presented in Equation 3 . Therefore, for each gate, the algorithm evaluates all its cell options and selects the version which locally minimizes the sum of timing cost and leakage cost multiplied by the scaling factor. The TR-LR execution finishes when the TNS is less than 10% of the clock period, such as in [13] , or when the number of iterations exceeds 50. The LDP solver used in the PR-LR is presented in Algorithm 2. In each iteration of the LDP solver, the Lagrange multipliers are updated using the formula 3 presented earlier.
However, it is used a different exponent distribution, so that the leakage power is quickly reduced. In order to achieve a rapid convergence without harming too much the timing of the design, it is used the values presented in Table III . Like the exponent distribution used in TR-LR, the exponent values used during PR-LR were found empirically. As in TR-LR, the Lagrangian relaxation subproblem is solved using the the same algorithm presented in [1] , but adapted to consider the formulation presented in Equation  3 . However, it is applied a cell option candidate filtering strategy, which will be discussed next, to reduce the set of cells evaluated for each gate.
In each iteration of the LDP solver, the global solution found is considered only if it has smaller leakage and if its TNS is less than 10% of the clock period. The execution of the PR-LR finishes when the improvement in the leakage power is less than 0.1%.
C. Cell Option Candidate Filtering
Experiments performed in this work showed that it is sufficient to evaluate only the neighboring options of the current cell option, that is, the next two bigger sizes, the next two smaller sizes, varying the V th between LVT, SVT and HVT for each size and, also, options with the same size, but different V th . This implies that the number of candidates evaluated depends on the current implementation option.
D. Average Delay per Leakage Scaling Factor Calculation
For each logic function available in the cell library, it is calculated an average delay per leakage factor based on the information provided by the library. For each cell option of a given logic function f , the delay of its timing arcs is computed assuming an output load four times greater than its input load. Considering a timing transition of a timing arc, the delay is obtained by varying the input slew until it is approximately equal to the output slew. When this condition is met, the algorithm accumulates the ratio delay/leakage. After all cell options were evaluated, the average delay per leakage of f is computed by dividing the accumulated ratio delay/leakage by the number of delay evaluations performed.
V. ENHANCED TIMING RECOVERY
We modified the timing recovery (TR) method proposed in the baseline work to improve the runtime. The new approach is divided into 5 steps. In the first one, the bottleneck cells are processed. Thus, all cells are sorted in decreasing order of centrality. It was empirically established that when 4% of the total numbers of cells are visited, they are sorted again. Hence, we avoid the sorting of cells at any time a cell change improves the TNS, which harms the runtime. The algorithm tries to swap each cell to the option with the next bigger size, keeping the V th , so that the leakage does not increase too much. Also, in order to correctly evaluate the change in timing, an incremental timing update is performed. The cell option is accepted only if the TNS is not degraded and slew and load violations are not generated. Otherwise, the change is undone. When a few endpoints with timing violations are left, we observed that sorting the cells in decreasing order of criticality is more effective in terms of convergence. Thus, the second step starts when at most 10 endpoints must be fixed to meet timing. In this step, the cells are sorted only once in decreasing order of criticality (min{0, slack W NS }). Each visited cell is handled in the same way as before, that is, the algorithm tries to upsize the cells. The third step comes into play when there is only one endpoint left with timing violation. For this endpoint, only the cells in its critical path are processed. In the fourth step, if the timing is not met yet, the algorithm of the second step is called once more. Finally, if this flow fails, the timing recovery from [1] is executed to fix the remaining timing violations, although it has not occurred in our experiments.
VI. SENSITIVITY-BASED POWER RECOVERY
We extended the post-processing power recovery (PR) method presented in [1] . In our new approach, the cells are sorted in increasing order of arrival time sensitivity, a metric that reflects how much a cell is likely to affect the timing given a change in the arrival time at its output pins.
For each visited cell, the algorithm tries to increase the V th as much as possible and, after that, decreases its size in the same way. Every change is accepted only if slew, load and timing violations are not created.
The timing cost is defined as the sum of the squares of arrival times at the endpoints: T iming = For instance, the timing sensitivity at pin 1 in Figure 2 can be expressed using the chain rule as in Equation 4 . 
∂T

VII. EXPERIMENTAL RESULTS
The proposed flow was implemented in C++ 11 over the Rsyn infrastructure [16] and was evaluated using the ISPD 2012 Discrete Gate Sizing Contest benchmark suite [14] . The experiments were carried out in a machine with four Intel(R) Core(TM) i7-6700 @ 3.4GHz CPUs and 32GB memory.
The leakage power results obtained are shown in Table IV . The results are compared with the the baseline work [1] and they are also compared with the RGS [13] , since it is the fastest gate sizing tool that can be found in literature so far. The results show that the new gate sizer performs only 1.1% worse in leakage power, on average, than the baseline sizer, while it performs 0.4% better, on average, than [13] . Also, by analyzing the leakage power obtained for the netcard fast circuit, which is the largest circuit of the benchmark suite, the sizer developed in this work marginally outperforms the results obtained in [1] and [13] with a leakage power savings of 0.27% and 0.54%, respectively. The runtime and runtime breakdown are shown in Table V . For some fast benchmarks, the new gate sizer spends most of the runtime in the timing recovery phase. It is expected, since the slow benchmarks do not require a hard effort to remove the timing violations. Although the sizer developed in this work was implemented on an infrastructure different from [1] and [13] , Table VI presents the runtime comparison between our work and the other two works. It is important to highlight that a fair comparison would require the three works to be implemented using the same framework. However, a comparison is provided in order to present a complete comparison between the three works. It can be noticed that for the vga lcd fast benchmark, the new sizer is way slower than the other works, being 19.50× slower than [1] and 291.51× slower than [13] . This could be explained by analyzing the runtime breakdown, Table V , which shows that, for this benchmark, the timing recovery phase corresponds to 84.15% of the total runtime. Since the RGS does not rely on a post-LR power recovery phase, it is compared in Table VII the new gate sizer without the enhanced power recovery step and the RGS. The results show that, in general, the new sizer produces very close leakage power results, being only 2% worse, on average. The RGS performs 13.4% better for the leon3mp fast benchmark, since the post-LR power recovery phase of the new gate sizer improves the leakage power significantly for this benchmark. Also, without the power recovery phase, the new sizer is still slower than the RGS, but, as shown in the table below, it is now 34.94× slower, on average, while with the power recovery phase it is 66.33× slower, on average. So, a speedup of 47.32% is obtained by removing the power recovery. Table VIII presents the true speedup obtained in this work. In this table, the new sizer is compared with a version of the baseline sizer implemented on Rsyn, which is refered in the table as Baseline Rsyn. In order to make a fair comparison, it has been chosen some benchmarks such that the leakage power results obtained by Baseline Rsyn match or are very close to the results reported in [1] . The results show that the new sizer is 9.77× faster, on average, than the baseline sizer implemented on Rsyn. Finally, Table IX presents the real gain obtained in our work. The new gate sizing tool requires 4.28× fewer LR iterations, on average, than the baseline work. Also, during the LR phase, it evaluates 9.11× fewer cell option candidates, on average, than the baseline sizer. It is also achieved a significant reduction in leakage power blow up during the LR phase. Compared to the final leakage power values presented in Table  IV , the baseline gate sizing tool increases the leakage power 9.65×, on average, while the new improved tool only increases the leakage power 2.74×, on average. It was obtained an expressive reduction in the number of LR iterations for the netcard slow benchmark. While the baseline work requires 100 iterations during the LR, this work required only 10 iterations. Thus, in terms of iterations, the speedup obtained was 10×. As a consequence, the new sizer evaluated 22.66× less cell option candidates than the baseline sizer, which is a significantly reduction.
Considering the vga lcd slow benchmark, the leakage power peak during the LR phase of the baseline sizer reaches 15.35× the final leakage power value, while the leakage power peak of the new sizer is 2.66× the final value. This result shows the effectiveness of the strategies adopted to reduce the leakage power blow up during the initial LR iterations.
By comparing the leakage power results from Table IV and  the improvements presented in Table IX , it is possible to notice that the sizer tool developed in this work was able to tackle the drawbacks presented in the beginning of this chapter without harming too much the leakage power, since it is only 1.1% worse on average than the results obtained by [1] .
VIII. CONCLUSION
We presented a set of strategies to cope with some drawbacks of LR based gate sizing algorithms. Hence, we used the gate sizer proposed in [1] as our baseline. Compared to [1] , the new approach produces similar leakage power results performing 4.28× fewer LR iterations, on average, and 9.11× fewer cell swaps during LR, on average. Also, we managed to reduce the leakage power blow up in the LR phase from 9.65× the final value, on average, to 2.74× the final value, on average. Furthermore, our gate sizer without the postprocessing power recovery step still produces similar leakage power results if compared to [13] , performing slightly better for some benchmarks. In addition, we modified the timing recovery method to improve the runtime, since this work was implemented over an infrastructure that is not optimized for gate sizing. Finally, we proposed the arrival time sensitivity metric to identity cells that are good candidates to be processed during the power recovery phase.
