Abstract-Algorithms and tools used for IC implementation do not show deterministic and predictable behaviors with input parameter changes. Due to suboptimality and inaccuracy of underlying heuristics and models in EDA tools, overdesign using tighter constraints does not always result in better final design quality. Moreover, negligibly small input parameter changes can result in substantially different design outcomes.
I. INTRODUCTION
With the rapid scaling of design complexity and quality requirements for system-on-chip products, electronic design automation (EDA) tools are confronted with ever-increasing problem difficulty and instance size. Virtually all underlying problem formulations within the IC implementation flow are NP-hard, and only heuristics -typically with unknown suboptimality -are usable in practice. In addition, due to the shrinking scale and large variability of nanometer process technologies, the modeling and analysis of physical problems are often fraught with accuracy challenges.
Commercial EDA tools and methodologies are used in production design because they improve turnaround times and design productivity. However, due to the suboptimality of underlying heuristics and the inaccuracy of underlying models and analyses, designers typically expect to spend significant time and effort after the tool flow is completed, to analyze remaining problems and fix them manually. To trade off quality of results versus implementation time and effort, designers must be able to predict the output quality of heuristic tools and methodologies. Based on their predictions, designers may target different objectives (e.g., minimizing area rather than performance), or different values of objectives (e.g., higher clock frequency, or lower leakage power).
With the above in mind, "predictability" is one of the most important attributes of IC design automation algorithms, tools and methodologies [1] . However, EDA heuristic approaches do not always behave according to users' intentions. Yakowitz et al. [5] study random search in machine learning under noise, and Kushner [4] analyzes the effect of noise in machine learning as stochastic approximation. Hartoog [3] observes the existence of noise in a VLSI algorithm and proposes exploitation of the noise to produce various benchmark circuits. At the first Physical Design Symposium in 1997, Ward Vercruysse of Sun Microsystems referred to the back-end implementation flow as a "chaos machine". The implication is that a very small change in inputs could lead to a very large change in outputs. In 2001, Kahng and Mantik [2] examined 'inherent noise' in IC implementation tools, i.e., how equivalent inputs could lead to different outputs.
In the present work, we assess the nature of 'chaotic' behavior in IC implementation tools, with an aim to establishing a beneficial methodology from such chaos. To assess the chaotic behavior, we experimentally determine answers to four questions:
• How strongly correlated are post-synthesis netlist quality and post-routing design quality? • How strongly correlated are design quality as evaluated by vendor place-and route tools and design quality as evaluated by signoff tools? • What chaotic behavior is associated with input parameters of vendor synthesis tools? • What chaotic behavior is associated with input parameters of vendor place-and-route tools?
Our experimental analyses confirm substantial "chaos" in vendor tools, e.g., worst timing slack can vary by up to hundreds of picoseconds, and area can vary by up to 16.4%, in a given (65nm) block implementation. Furthermore, there is little correlation between major design stages.
Basedonourexperimentalresults,weestablishamethodology to exploit the unavoidable noise and chaos in backend optimization tools. Our methodology is based on 'multirun' and/or 'multi-start' execution of the design flow with small, intentional input parameter perturbations. This recalls the work of [2] , which also proposed the exploitation of 'inherent tool noise' to achieve more predictable and stable tool outcomes. A key difference between [2] and our work is that the 'noise' sources in [2] -renaming cell instances, perturbing the design hierarchy, etc. -are not practically usable. On the other hand, our proposed 'chaos' levers, such as changing clock uncertainty by 1 picosecond, are trivially implemented and transparent to the design flow. With the increasing availability of multiple and parallel computing platforms -multiple workstations in server farms, multiprocessor workstations, multicore processors, and multithreaded cores -our multi-run approach can be deployed with negligible impact on the overall design cycle time. 1 We also provide a method to find which input parameters are most sensitive to the perturbation, and optimal number of runs to obtain reasonably good and predictable solutions.
The remainder of this paper is organized as follows. In Section II, we examine our four motivational questions regarding the unpredictability and 'chaotic' nature of commercial tools.
We describe our proposed method for exploiting chaotic tool behavior in Section III. Finally, Section IV gives conclusions.
II. CHAOS IN IMPLEMENTATION TOOLS DUE TO INTENTIONAL INPUT DISTURBANCE
There are two basic types of knobs that affect the quality of optimization.
• Tool-specific options -specifically, command options to turn on/off specific optimization heuristics. (We do not explore this, but use the same default tool options in all of our experiments.) • Design-specific constraints -notably timing-related constraints (clock cycle time, clock uncertainty, input/output minimum/maximum delay, etc.) and floorplan-related constraints (utilization, aspect ratio, primary pin locations, etc.). In our study, we only examine the impact of intentional perturbation of design-specific constraints. For synthesis, we explore impact of timing-related constraints, and for placement and routing (P&R), we explore the impact of both timing-and floorplan-related constraints. We vary design-specific parameters by small amounts, and measure design quality metrics such as worst negative slack (WNS), total negative slack (TNS), and total standard-cell area.
We implement four testcases using a T SMC 65nm GPLUS library: two open-source cores, AES and JPEG, obtained as RTL from opencores.org [6] , and two subblocks LSU (load and store unit) and EXU (execution unit) of the OpenSparcT 1 design, obtained from the Sun OpenSPARC Projects site [7] . We use a traditional timing-driven synthesis, placement and routing flow, and analyze final timing quality using a signoff RC extraction (Synopsys STAR-RCXT [13] ) and a signoff static timing analyzer (Synopsys PrimeTime [10] ). All specific tool versions are given in the references section below. Table I shows one of the implementation results for each of our testcases at the signoff stage. The remainder of this section describes our experimental investigation of the four motivating questions above -on the correlation between design stages, and on the chaotic behavior of implementation tools. 
A. Correlation of Quality between Design Stages
Motivating Question 1: How strongly correlated are postsynthesis netlist quality and post-routing design quality? It is by no means certain that a better-quality synthesized netlist will eventually lead to a better-quality design after placement and routing (P&R). Our first experiments examine the impact of the quality of input netlists on final P&R outcomes. We synthesize the AES core with various clock cycle times to have different timing quality at a target clock cycle time in synthesis. For each synthesized netlist, we perform placement and routing at the given target clock cycle time. We use Cadence RTL Compiler for synthesis and Cadence SOC Encounter for P&R. Table II summarizes the worst negative slack values after synthesis and after P&R, respectively. The first column shows the clock cycle time applied at synthesis stage, and the second column shows the worst negative slack (WNS) with the target clock cycle time, i.e., 2ns. The third and fourth columns show the WNS values after placement and routing, obtained from Cadence SOC Encounter (SOCE) [9] and Synopsys PrimeTime (PT ) [10] , respectively. From the data, we observe that an input netlist with better timing slack netlist does not always result in better timing slack after placement and routing. Furthermore, due to the timing miscorrelation between P&R and signoff tools, the worst-slack netlist from synthesis stage, obtained using largest clock cycle time (i.e., 2.4ns), actually results in the best timing at signoff. How this can occur is suggested by Figure 1 , which gives a rank-correlation plot of endpoint slack values of timing paths in the AES netlist, between post-synthesis and postplacement stages. The correlation coefficient is only 0.421. Due to this miscorrelation between synthesis and P&R stages, the eventual benefit from maximizing the quality of the postsynthesis netlist is unclear. This gives us some intuition that any incremental tool runs should be directed to the P&R stage rather than the synthesis stage. 
Motivating Question 2: How strongly correlated are design quality as evaluated by vendor place-and route tools and design quality as evaluated by signoff tools?
Aside from the suboptimal nature of underlying optimization algorithms, there is another source of noise in the traditional implementation flow. Timing optimization is always based on (incremental) timing analysis. As is well known, such timing analysis requires models for gates and interconnect: timing and power models for gates are precharacterized in lookup tables using SPICE, and interconnect RC models are extracted from layout with precharacterized capacitance tables. Using the gate and interconnect models, effective load capacitance, slew degradation, interconnect delay are calculated using, e.g., asymptotic waveform estimation. The effective load capacitance and slew values thus obtained are then used as table indices to find corresponding gate delay values from delay tables. However, optimization tools use simplified models and embedded calculators to evaluate timing with the least possible computational expense. As a result, timing results seen by optimization tools can differ from those seen by signoff timing analysis tools. Figure 2 shows the timing correlation between a signoff timing analyzer Synopsys PrimeTime, equipped with a signoff RC extractor Synopsys Star-RCXT, and two place-and-route tools, Synopsys Astro (AST RO) [11] and Cadence SOC Encounter (SOCE) [9] , for 29 different implementations of the AES core. We observe that more than 200ps of timing slack difference can occur.
To understand methodology implications of P&R vs. signoff discrepancies, we make a brief excursion into root causes of such discrepancies. Typical root causes are as follows.
• RC-extraction. We extract RC values from a signoff extractor Synopsys Star-RCXT (STAR) and a place-androute tool Cadence SOC Encounter (SOCE). We compare the extracted capacitance values using the Cadence Ostrich program. Figure 3 compares the extracted capacitances from STAR with those from SOCE. We observe that SOCE underestimates capacitance by 18.6%. 2 This significant difference may explain why SOCE so consis-tently sees optimistic timing slacks when compared to the signoff tool ( Figure 2 ). Fig. 3 . Normalized capacitance correlation between a signoff RC extractor Synopsys Star-RCXT (STAR) on the x-axis, and a place-and-route tool Cadence SOC Encounter (SOCE) on the y-axis.
SOCE STAR
• Delay calculation. We compare timing between SOCE and PT with the same RC parasitic file from SOCE, to eliminate the impact of the discrepancy in RC extraction. Table IV shows the WNS and TNS calculated from both tools for our four testcases with default input parameters. The data suggests that delay calculation in SOCE and PT is well-correlated, as is usually the case. (N.B.: Viable implementation and signoff tools will calculate delay on a given extracted path to within at most a couple of tens of picoseconds difference from the 'golden' tool.) • Other: Path-tracing and signal integrity in timing analysis.
(1) Endpoint slack differences greater than several hundred picoseconds are typically due to discrepancies in path-tracing -i.e., the interpretation of timing constraints, timing exceptions, generated clocks, cyclebreaking in the timing graph, etc. -in the static timer. However, the testcases we use have only simple timing constraints consisting of one clock definition along with input/output delay constraints. Thus, path-tracing is not a factor in the timing miscorrelation between P&R and signoff in our experiments. (2) Endpoint slack differences on the order of 100ps are often due to discrepancies in signal integrity (e.g., crosstalk-induced delay variation) analyses. However, all timing analyses in our experiments are performed with signal integrity options turned off. Thus, signal integrity is not a factor in observed timing miscorrelations, either. Although there is a discrepancy between place-and-route tools and signoff tools, we believe that this discrepancy (in all experimental data we report here) is systematic and attributable to internal RC extraction and path delay calculation. As shown in Figure 2 , SOCE consistently underestimates the timing slack. This correlated discrepancy can be predicted and compensated. We may infer that design quality at the P&R stage is a stronger lever on final signoff timing than design quality at the synthesis stage.
B. Chaotic Behavior in Optimization Tools
In this subsection, we assess the impact of intentional perturbations applied to input parameters of synthesis and place-and-route tools. (As noted above, the perturbations that we study differ from the netlist manipulations studied in [2] , and are transparently applied without changing the design flow.)
Motivating Question 3: What chaotic behavior is associated with input parameters of vendor synthesis tools?
We analyze the impact of perturbation of timing-related parameters, such as clock cycle time, input/output delay, and clock uncertainty, at the synthesis stage. We use two commercial gate-level synthesis tools, Synopsys Design Compiler (DC) [12] and Cadence RTL Compiler (RC) [8] . We vary each parameter by an amount ranging from -3ps to 3ps with 1ps increments, and measure the changes in netlist quality metrics. Table II -B summarizes WNS and total standard-cell area of the resulting synthesized netlists.
Ideally, small perturbations of, e.g., a few picoseconds in input parameters should not change output quality, or should have predictable consequences. For example, a reduction of clock cycle time by 1 picosecond can be reasonably expected to result in a reduction of timing slack by the same 1 picosecond, since the difficulty of design optimization is virtually unchanged. However, due to the unpredictability of optimization tools, the resulting design quality appears to be random. We observe up to 53ps and 34ps of WNS variations in DC and in RC, respectively. Among the results, we observe that some input perturbations result in better timing quality than the original (without perturbations) design optimization. This improvement can be regarded as a benefit from 'chaotic behavior' of the design optimization tools. However, as discussed in Section II-A, input parameter perturbations at synthesis may not have a great effect on final signoff design quality, due to the unpredictable miscorrelation between synthesis and placeand-route tools.
Motivating Question 4: What chaotic behavior is associated with input parameters of vendor place-and-route tools?
We also analyze the impact of perturbations of both timingrelated parameters and floorplan-related parameters, in placeand-route tools. Since our second experiment above suggested that the quality of an input netlist is not preserved during placement and routing, we take one synthesized netlist arbitrarily from the synthesis results in the Table IV results, and perform traditional timing-driven placement and routing. We use two place-and-route tools, Synopsys Astro (AST RO) and Cadence SOC Encounter (SOCE). The nominal clock cycle time, input/output delay, and clock uncertainty values are shown in Table I , and nominal utilization and aspect ratio are 70% and 1.0, respectively. Table V summarizes worst negative slack (WNS) and total negative slack (TNS) calculated using a signoff RC extraction and static timing analysis. We do not include the final area of placed and routed designs due to space limitations in the table, but the variation of area is observed to be as high as 16.4% in the AES core when utilization is increased by just 1%. 3 From the table, we again observe that small input parameter perturbations give rise to large timing slack changes, e.g., WNS varies by up to 190ps (from -96ps to -287ps) in EXU, and TNS varies by more than 69ns (from -152ns to -83ns) in JPEG.
C. Summary of observations
Our experimental study provides the following evidence for 'chaotic behavior'.
• Input parameter perturbation in synthesis results in up to 53ps of WNS variation, and sometimes produces better-quality netlists than the original design constraints. However, because there is little correlation between postsynthesis netlist quality and post-routing design quality, improved synthesis results will not necessarily improve results after placement and routing.
• There is a large discrepancy in timing slack between optimization tools and signoff tools, which in our studies arises mainly from (i) the difference between optimization-internal and signoff RC extractors, and (ii) small discrepancies in delay calculation between optimization and signoff tools. Timing slack with the same RC parasitics can vary up to 34ps solely due to delay calculation discrepancies. However, the discrepancy may be predictable and hence compensatable.
• Input parameter perturbation in place-and-route tools results in up to 190ps of WNS variation, up to 46ns of TNS variation, and up to 16.4% variation in total standard-cell area. In contrast to chaos in synthesis outcomes, the chaos in place-and-route outcomes appears more exploitable to improve final signoff quality.
• This chaotic behavior is unpredictable. In particular, for different design types or domains, it may not be possible to assess differences in nature of the chaotic behavior. However, from Table V, we can observe that more timingcritical designs (e.g., designs with more negative timing slack, more violating paths, or smaller clock cycle time) tend to show more sensitivity to the input parameter perturbation.
III. PREDICTABILITY FROM CHAOS In the previous section, we observed 'chaos': large variation in final implementation quality arising from small (negligible) input parameter changes. Frequently, results after small input perturbations are better than those obtained using nominal input parameter values. From the results, we expect that we can achieve better design quality (and, potentially, improved design cycle time) without additional human effort or flow modifications. The key idea: run multiple times with small input perturbations, and return the best-quality solution.
If only one CPU or tool license is available, then we can run multiple times on one CPU ('multi-run'), trading design quality for runtime. But when there are idle CPUs and licenses in a large computing farm, we may be able to use as many CPUs as possible in parallel ('multi-start') without affecting the design cycle time. In either scenario, we obtain a "bestof-k" methodology: (i) run k times on a CPU while varying input parameter values, or (ii) start k runs with different input parameter values, and take the best results out of the k different runs.
If we wish to experimentally determine the best number k of runs -in a statistically meaningful manner -it at first seems necessary to execute many trials for each possible value of k. For example, we could conduct N trials each with a set of k runs, then record the best solution out of each set of k runs, and then find the average ('expected') best solution for the given value of k. If we know the average best-of-k solution value for each value of k, then we can determine which k gives reasonably good solutions compared to the cost of resources. The challenge is that the above-described procedure requires far too many runs. Naively, if we run N trials of "best-of-k" runs, we may require N × k separate runs. And if we test six different values of k numbers -e.g., 1, 2, 3, 4, 5, and 10 -through 100 trials, we would have to perform 100 × (1 + 2 + 3 + 4 + 5 + 10) = 2500 separate runs.
To reduce the number of test runs needed to determine the best k value, we use the following sampling approach, which was originally presented in [2] .
1) Run a smaller number of different runs, e.g., 50 times with different inputs, instead of 2500 runs as in the previous example. 2) Record a quality metric, e.g., WNS, for each run. Then, assume that the set of solutions for the 50 runs is the 'virtual' solution space. 3) Randomly sample k solutions out of the 'virtual' solution space N=100 times, and record the best results for each choice of k solutions. This process replaces the actual N trials of k runs each. 4) Find minimum, maximum, and average values of the best results recorded from the N sampling trials. For our experiments, we use this "best-of-k" method out of all simulation results that were summarized in Table V . First, we find which input parameter is the most useful to perturb, with respect to k = 1, 2, 3, ..., 10. We randomly choose k solutions in each of 100 trials, out of our seven different runs for each perturbed input parameter: clock cycle time (T ), clock uncertainty (S), input/output delay (B), aspect ratio (A), and utilization (U). We then find worst, best, and average of the best WNS from 100 trials for each k value. Table VI shows the "quality" ranks of each input parameter for our testcases implemented using AST RO. From the table, we observe that clock cycle time (T ) or input/output delay (B) perturbations may be the best for the AES design, utilization (U) perturbation may be the best for the JPEG design, and input/output delay (B) or aspect ratio (A) perturbations may be the best for the EXU design. For LSU, when k is small, clock cycle time (T ) or utilization (U) perturbations give the best design quality, but when k is larger than six, aspect ratio (A) perturbations give the best design quality.
Second, we find the best k value if both the input parameters and the parameter values are randomly chosen, since the best knob is not common for different testcases. For each testcase, we randomly choose k solutions in each of 100 trials, out of the 35 different solutions available (five different input parameters times seven different (perturbed) values of each parameter) for the testcase. Figures 4, 6, 8 , and 10 (one figure per testcase) show the worst, best and average WNS values out of 100 trials of best-of-k sampling when AST RO is used for implementation. Figures 5, 7, 9 , and 11 show the worst, best and average WNS values out of 100 trials of best-of-k sampling when SOCE is used for implementation.
From Figures 4-11 , we observe that the average WNS from 100 trials improves rapidly with increasing k. In most cases, when k = 3, the average expected quality is within 20ps of the best solution quality. We also observe that the worst-case WNS from 100 trials can be improved significantly with small k. For example, when k = 3, multi-run or multi-start using SOCE is expected to improves WNS of EXU by more than 100ps. 4 AES ASTRO and signoff. We also characterize the effects of intentional, negligible perturbations of input parameters on output quality of commercial tools and flows. Based on our experimental results, we propose a methodology to exploit the chaotic tool behavior using 'multi-run' (1 license or 1 CPU scenario) or 'multi-start' (multiple licenses and CPUs scenario) with intentional perturbations of the input parameters. We also describe an efficient method to determine the best number k of multiple runs that will yield predictably high-quality solutions without any additional manual analysis or manipulation, without changing any design flows, and without wasting valuable computing resources.
The deployment of new implementation and signoff tool capabilities opens up new directions for ongoing and future work, including the following. (1) We seek to analyze the potential advantages of the inherent "chaos" in advanced physical synthesis tools that exploit physical information at the synthesis stage to reduce synthesis-placement miscorrelations. (2) We seek to evaluate the benefits of chaos in conjunction with more advanced signoff methodologies (e.g., signal integrity-enabled STA), as well as more advanced signoff analyses (e.g., path-based analysis), which may exhibit even more chaotic behavior than today's standard flows.
