In this paper we have applied statistical sizing in an industrial setting. Efficient implementation of the statistical sizing algorithm is achieved by utilizing a dedicated interior-point solution method. The new solution method is capable of solving a robust linear program, that is mapped onto a second-order conic program, an order of magnitude faster than the previously explored formulation. Our sizing algorithm is unique in that it represents variability in circuit delay analytically by formulating a robust linear program. The algorithm allows efficient and superior area minimization under statistically formulated timing yield constraints. In this paper, we also report the first use of statistical gate sizing in an industrial microprocessor design flow as a postsynthesis optimization step. Statistical delay models were generated for a 90nm CMOS standard cell library used in the design of an industrial low-power 32bit x86 microprocessor and practical issues related to iterative convergence were explored. When compared to the deterministic sizing, the area savings are 26% for the microprocessor module. The runtime of the algorithm is very low compared to existing statistical sizing methods, achieving an almost 15X speed-up, and scales as O(N 1.5 ), where N is the circuit size.
INTRODUCTION
Variability in process parameters leads to parametric yield loss due to timing and power constraints, and reducing this yield loss has become an important design need. This calls for new yieldcentric statistical design methodologies. Statistical timing analysis accounting for variability in circuit parameters [1] [2] [3] , is not in itself a sufficient tool for minimizing the effect of variability on design. Deterministic optimization by definition lacks the explicit notions of parameter variance and parametric yield, preventing design for yield as an active design strategy. There have been several recent attempts to introduce statistical considerations into circuit optimization, and sizing, in particular [4] [5] [6] [7] [8] . However, all these methods suffer from high computational complexity, which appears to be a fundamental challenge in statistical optimization, in general.
In this paper, we for the first time, demonstrate the use of statistical circuit optimization in the design of an industrial microprocessor, using the approach developed in [9] . In [9] , we developed a theoretical formulation of statistical circuit sizing via Second Order Conic Programming (SOCP). In this paper, a large number of practical challenges of using this statistical sizing technique in an industrial setting are addressed. First, a specialized solver utilizing efficient interior-point methods to solve conic programs has been used to dramatically increase the runtime and capacity of the statistical sizing based on SOCP. Second, the problem of choosing the margin coefficients for gates has been explored more rigorously and a new scheme is proposed. Third, the practical issues of very accurate gate delay modeling and statistical process modeling are addressed. A comprehensive way of modeling delay variability that captures both intra-and inter-chip variability is proposed. The variability and delay models are generated from and validated by industrial technology files and transistor models. Finally, the sizing algorithm has been integrated into an industrial CAD flow handling sequential elements, fixed size macros, and non-static logic elements. The flow was tested on a microprocessor block containing approximately 10K logic gates and 1K sequential cells. To our knowledge, this is the first attempt to apply statistical sizing in an industrial setting. We achieve significant savings in area of 26% for that block, and the savings of 8-30% across different benchmark cases, at the same frequency target without any parametric yield loss. We observe that the run-time scales as O(N 1.5 ), where N is the number of gates in the circuit, which is close to the theoretically predicted behavior. We observe a 15X speed-up compared to existing statistical sizing approaches [4, 6] . For example, the optimization time for the microprocessor block is 24 minutes.
STATISTICAL MODELING AND OPTIMIZATION
The statistical equivalent of deterministic sizing can be formulated as the following chance-constrained problem:
Here the constraints are to be met with probability of α , the required yield level at the timing target T .
To enable computational tractability, we adopt a piece-wise gate delay modeling method [10] , extending it to a rigorously Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. 
where L ∆ , th V ∆ and W ∆ are the parameter random deviations. Though, the precise dependence of the sensitivities, i.e. the first derivatives of delay with respect to the parameter, on gate size is posynomial, the sensitivity coefficients of a gate need to be represented as linear functions of the driver and load sizes to capture the dependence of the variance of gate delay on the decision variables (gate sizes) of a second-order conic program. We use an empirically fitted linear model for this purpose. For example, the sensitivity of delay to gate length variation is empirically modeled as:
To evaluate the goodness of the fit, we performed a Monte Carlo simulation on a representative NAND2 gate in the library and compared the resulting delay distribution with that predicted by our model. The rms error was found to be ~5%.
The modeling framework must consistently handle different decompositions of variability into inter-chip and intra-chip components. This is accomplished here by adopting a linear additive model that decomposes the variability of all parameters into the intra-chip and chip-to-chip variability components. For example, for the effective channel length the model is:
inter intra , and
intra . The gate delay co-variance can be written down as: are the intrachip components of variation. Note that this model is not based on the knowledge of the specifics of the spatial correlation of intra-chip variability, in contrast to [1, 2] . Thus, the gate-to-gate correlation is assumed to come from the joint impact of intra-and inter-chip variability only. We believe this is reasonable because (a) the data on spatial correlation is typically not available, (b) spatially correlated components are not numerically significant [11] and (c) it is possible to use an additive model to bound the impact of any remaining spatially-correlated intra-chip variability.
The final statistical sizing formulation [9] is repeated here for convenience: The above formulation however does not permit us to differentiate between inter and intra-chip components of variation. To accomplish this, we rewrite the percent point function of gate delay 1 ( )
such that it is linear in the interchip component. This transformation is equivalent to assuming that all the paths in the circuit have the same sensitivity to the inter-chip variability of the process parameters. This can be justified as inter-chip variation inherently assumes perfect correlation between devices on a chip. For the sake of exposition consider two sources of variation namely, L and W. The percent point function of path delay is now given by:
≥ it ensures the original constraint is satisfied.
APPLICATION OF THE ALGORITHM IN THE INDUSTRIAL MICROPROCESSOR DESIGN FLOW
Incorporating statistical optimization into a practical design flow requires a modified development methodology. Figure 2 depicts the entire statistical sizing flow that we have implemented. At the preprocessing stage, the statistical gate delay models are generated and the combinational slices are extracted from the sequential circuit using the algorithm described in Figure 3 .
A key step in the implementation of the statistical sizing algorithm is the assignment of margin coefficients
to all paths in the circuit such that the timing yield constraints are satisfied and the minimum area is obtained across all possible such assignments. Since it is difficult to choose a priori the margin coefficients such that the circuit meets timing when statistical timing analysis is run, we need to explore several solution strategies to determine a heuristic for path margin coefficient assignment. Our experiments indicate that the minimum area solution is guaranteed to be found with a uniform assignment of margin coefficients to all the paths. Therefore, setting For very low assignments of k, the timing constraints are not satisfied when Monte Carlo analysis is performed. The minimum value of k at which circuit timing becomes feasible is identified as the optimal margin coefficient assignment.
Using the margin coefficients obtained from the above heuristic, the robust linear sizing problem is formulated. This is then converted into a second order conic program and is solved using a dedicated interior-point optimization package [12] . Timing is checked, and if the difference between the true quantile timing value and the target is greater than a pre-specified error, the target is adjusted and the optimization is re-run, until the timing is met at the required percentile.
Extending statistical optimization to an industrial design flow requires addressing several additional issues. The major ones are dealing with sequential elements and handling non-static logic. Even when microprocessor modules are largely implemented in static CMOS, they typically contain a number of gates implemented using non-static (pass-transistors or transmission gates) logic (i.e. multiplexers and XOR gates). These currently cannot be sized using an automated sizing algorithm. The feedback present in sequential elements may render the optimization problem infeasible. Since our sizing algorithm is node based, we approach the problem of sizing a sequential circuit by extracting the combinational slices from a structural post-synthesis Verilog file. All gates between two adjacent flopboundaries are treated as a single combinational slice. The flipflop outputs are treated as input nodes to the combinational block the output nodes. The arrival time of signals at an input node is now given by t setup + t clk-Q of the flop. To handle non-static logic, we adapt a simple approach. Such cells are identified and assigned the th α percentile of delay D α , corresponding to the size obtained from a deterministic path-based sizing heuristic, such as based on logical effort, and a realistic fanout that we take to be FO4. Here α is the desired parametric yield. The delay D α can be obtained by performing a Monte Carlo simulation of the cell.
Since we fix the size of the cell, we introduce a structural constraint to capture this. We also need to make sure that the fanout gates aren't sized up such that the delay is greater than the α -quantile delay D α . An additional constraint restricting the load on the gate is thus, also introduced. For example, if gate j is a multiplexer, which is typically implemented using transmission gates, the additional constraints introduced are:
Here j s is the size of the gate, , ( ) k s k FO j ∈ are its fanout gates and j d is its α -quantile delay. The pseudo-code in Figure 3 describes the procedure of extracting the combinational slices from a sequential circuit. It consists of a path tracing routine to obtain the fanout cone of logic for every gate driven by a flop and a levelizing routine that assigns levels to each flop present in the circuit. We assign levels to the flops so that we can identify the gates that are present between any two flop boundaries. In this procedure, some gates may appear in different combinational slices if they appear on paths that fanout to flops at different levels. To avoid these gates being sized multiple times, the statistical sizing algorithm is applied starting with the level closest to the primary outputs. We keep track of gates that have previously been sized, and at succeeding levels, these gates are not sized again.
EXPERIMENTS AND RESULTS
The developed methodology was tested within an industrial microprocessor design flow for a low-power 32-bit x86 processor. A proprietary standard cell library targeted for a 90nm CMOS process was used. The library contains 22 cells, with more than 15 drive strengths. We used a sub-set of this library. The technology is bulk CMOS. The technology was characterized statistically with respect to three parameters: effective channel length (L eff ), minimum transistor width (W), threshold voltage . .
3.Perform breadth first search starting from each 4. Perform backward traversal of circuit graph to levelize . F (V th ). The variation was found to be 5% for L eff and W, and 8% for V th in terms of ( / σ µ ) values. Statistical delay models were generated using a circuit simulator HSPICE by the Monte-Carlo simulation method. The SOCP algorithm was implemented using the commercially available conic solver MOSEK [12] . The methodology was applied to the design of a large industrial microprocessor module containing nearly 10000 gates and 1000 flops, as well as a variety of smaller public (ISCAS'85) benchmark circuits. The algorithm was run on a dual core 2GHz. AMD Opteron machine with 4GB of RAM.
Construct the hash array
From Table 1 , for the processor block, the reduction in area enabled by statistical sizing is 26%. target T corresponds to the minimum delay through the circuit obtained by unconstrained optimization in this deterministic setting, where the varying parameters are set to their worst case values. det A is the corresponding area.
%
A 99.9 is the area obtained by the statistical sizing algorithm for the timing constraint equal to T target at the 99.9% yield level. An equal breakdown into intra and inter-chip components is assumed for the correlated case.
The application of the statistical flow to the microprocessor block was studied in depth with respect to different decompositions of process variability. Figure 4 shows the area-delay Pareto curves for different structures of variability. We experimented with three different breakdowns. The area improvements are better for higher ratios of intra-chip variability. As the intra-chip component increases, the sizing algorithm is able to find a configuration with lesser area for the same delay target. Table 1 and Figure 5 also point to the run-time behavior of our algorithm. We observe that the run-time of a circuit grows as O(N 1.5 ), where N is the circuit size. The runtime to optimize the microprocessor block of 10K gates is reasonable, close to 24 minutes. The run-time for the largest benchmark circuit is on the order of several minutes (197s). This is more than 15 times faster in the approaches presented in [6, 10] . This is the biggest advantage of the proposed algorithm. At the same time, the area savings are also very good, as we achieve greater savings in area compared to [6] .
CONCLUSION
In this paper, we report the first application of large-scale statistical design optimization in an industrial microprocessor design flow. The statistical sizing is based on a very fast implementation and results in substantial area savings.
