This paper exploits useful skew to improve system performance and robustness. We formulate a robust integer linear programming problem considering the interactions between data and clock paths on a microprocessor chip to improve clock frequency. The timing slack is optimized for each path to determine a clock schedule. The percentage of timing violations, obtained from a 1000 point Monte Carlo simulation, is higlighted as yield predictions and conveys the robustness of the clock schedule. The results show performance improvement of up to 9.747% with 20% yield and up to 6.682% with 100% yield. The novelty of the proposed method is its ability to tradeoff between performance improvement in frequency and robustness, via a single variable in the formulation.
INTRODUCTION
The ITRS future frequency trend for high performance microprocessors is predicted upwards to 20Ghz. One of the direct impacts of aggressive scaling is the increase in clock skew as a percentage of clock cycle time reducing the clock budget for useful computation. Early approaches were focused on designing clock distribution topologies to keep clock skew at a minimum. These efforts managed results close to zero skew but with huge power and area sacrifices. Also, with zero skew, the maximum achievable operating frequency is limited to the maximum datapath delay in the circuit. In the quest to find alternatives, there has been significant effort [4] , [12] , [6] , [3] , [10] , [2] [5] to explore useful skew to improve performance. Although useful skew enables systems to operate at higher clock frequencies, more and more signal paths get pushed towards the edge of satisfying timing requirements. As the amount of uncertainty increases with scaling [15] , [13] , [9] , the probability of failure for a design with useful skew increases. The random process and environmental variations that dominate the behavior of * Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICCAD '06, November 5-9, San Jose, CA devices are hard to predict, let alone eliminate, making the clock skew a highly difficult source to manage. There is thus the need for a methodology to model the uncertainties to improve robustness of the design. Robustness, in the context of useful skew, is the percentage of chips that can meet timing requirements for the optimized clock schedule, intended to improve performance.
This paper presents a LP-based ILP formulations to optimize useful skew with considerations for uncertainties in clockpath and datapath delays. It attempts to predict the amount of risk involved in pushing the operating frequency using useful skew. Solutions to robust formulations are typically not optimum. Throughout this paper optimal is refered to the best trade off among competing constraints, i.e., performance and robustness. Our approach has the following salient features: 1) it considers variations in combinational block and clock delays; 2) it incorporates considerations for the register locations and physical clocking domains that have the same amount of clock skew adjustment; and 3) previous approaches [10] , [5] , [4] have considered robustness, however our approach has the ability to control the trade-off between performance and robustness of the clock schedule. When applied to a 64-bit microprocessor the percentage improvement in clock frequency ranges from 9.747% with 20% yield (no considerations for uncertainties), to a percentage improvement of 6.682% with a yield of 100% (data and clock uncertainties) The yield (i.e. the robustness) data was obtained using a 1000 point Monte Carlo simulation.
The remainder of the paper is organized as follows. Section 2 discusses existing research and Section 3 presents the background of our work. Section 5 presents the formulations used to optimize the clock frequency. Clock Distribution details of the microprocessor used in our experiment are given in Section 4. Section 6 and Section 7 discuss the experimental results and conclusions respectively. .
EXISTING RESEARCH
Retiming [8] was one of the earlier solutions to the problem of clock skew optimization. Retiming cannot be applied to certain areas and could increase the number of flops. This motivated Fishburn [4] to explore LP approaches to optimize different circuit parameters, performance and robustness, by introducing useful skew. The inability in Fishburn's approach to handle level triggered flops was highlighted and a formulation for for level triggered memory elements was presented by Sakallah et. al. [12] . Chao and Sha used the concept of scheduling to form a common background and proposed a combined retiming and clock skew introducCopyright 2006 ACM 1-59593-389-1/06/0001 ...$5.00. tion approach in [2] to study the interplay between retiming and clock skew introduction. Simultaneous, as opposed to sequential, clock skew scheduling and retiming could potentially provide better results was discussed in [6] . A twophase graph approach was proposed in [3] to improve Fishburn's approach in terms of maintaining an upper bound on the skew. The lack of considerations for parallel and feedback paths was addressed in [10] by suggesting an approach to optimally assign each path with a skew value after determining a common permissible range for all the paths, between any two nodes. This approach provided a more robust design compared to that in [3] . However, all variations were collected into a single parameter making it impractical to handle large complex chips. [5] proposed a Quadratic programming problem to determine a clock skew schedule with improved tolerance to process variations by minimizing the least square distance between the desired and actual values of the clock skew schedule over the entire circuit. A different approach was presented in [11] . The objective was to obtain a clock schedule that achieves a shorter clock period and can be realized by a light clock tree. The algorithm takes into account the register locations while optimizing the clock schedule so that the wire lengths for clock distribution are within acceptable limits.
In light of the advances in technology and increased variations the existing approaches have the following shortcomings: 1) most of the approaches failed to consider the impact of variations in the combinational delays; 2) most of the approaches did not consider clock domains while trying to optimize the clock period to minimize wire lengths and power consumption. 3) there is no mechanism in the existing algorithms to trade performance against robustness.
BACKGROUND
The use of slack that exists between the required arrival time for the data (t reqd. ) and the actual arrival time of the data (t delay ) has the following advantages: 1)the slack information helps eliminate the difference between level triggered latches and edge triggered flops. It provides an effective model for both level triggered and edge triggered memory elements; and 2) the convenience of slack from typical static timing analysis takes into account the clock delays at the latch points, relative to data delay. There are two different types of paths in a typical logic circuit, state paths and phase paths. If adjacent memory elements are fed with clock signals having opposite phases, the path between them is characterized by half a clock cycle and is called a phase path. The state path exists between a pair of memory elements fed with the same phase clock and triggering at the same clock event. An increase in the slack for setup implies a decrease in the slack for hold. Thus, for an independent path, the maximum amount of improvement can be determined by formulating a problem that maximizes M subject to the constraints expressed below
where slackmax and slackmin are the minimum slacks for the maxtime and mintime path between register i and j, M is the possible reduction in clock period and xi and xj the clock arrival time at the respective latches.
CLOCK DISTRIBUTION DETAILS
In modern 64-bit microprocessors the clock distribution network is often highly hierarchical and complex in order to achieve the maximum operating frequency possible. The use of useful skew fits well with its requirement for high performance. We apply our proposed useful skew optimization algorithm to a latest 64-bit microprocessor [7] to examine its effectiveness. Figure 1 shows that the PLL is the center of the clock distribution topology. It is responsible for generating the actual operating clock signal. Each CPU core is divided into three frequency regions, with a digital frequency divider (DFD) in each region, for independent control. The voltage sensors convey to the regional voltage detector (RVD) changes in the regional voltage and help maintain the clock to data ratio thus avoiding possible timing violations. The second level clock buffers (SLCBs) comprise of self bias amplifiers that re-enforce the clock signal from the digital frequency dividers. The regional active deskew (RAD) internally comprise of a set of CVD's and a phase comparator that are responsible for region based active deskewing to reduce the skew resulting from process, voltage and temperature variations. The SLCBs feed the clock signal to the local buffers through a different set of CVDs and then down to the gaters. The CVDs are programmable devices, with a 3-bit(000-111), 8 quantized level delay programmability. These CVDs can be programmed via scancand firmware to debug post-silicon defects and to remove any skew that may be present. CVDs can also be used to introduce useful skew. In theory, the CVDs present within the RAD can be programmed to increase the range available for those CVDs that follow the SLCBs. Our experiments were carried out with adjustments enabled at the SLCB level only and at both RAD and SLCB levels. Table 4 summarizes the clock distribution system on the 64-bit microprocessor and gives an idea of the size of test circuit. Each CVD at the SLCB level controls a set of latches that will take the same skew adjustment from the controlling CVD and these latches are classified into a single domain.
ROBUST FORMULATIONS
A simple ILP formulation to maximize clock frequency by Maximize M subject to:
where l is the number of flip flops and n is the number of quantization levels within a clock tuning element. w lm is the m th quantization variable associated with the l th memory element and Qm is the m th quantization level. Constraints B1,B2 and I1 together make sure that xi and xj can take only one of the n quantized values. The S and H constraints are the setup and hold constraints expressed in terms of slack. slackmax and slackmin are the minimum slacks for the maxtime and mintime paths between register i and j. The setup constraint tries to maximize the negative skew that can be introduced while the hold constraint places an upper bound on the maximum negative skew that can be introduced to avoid mintime violations. coeff indicates how many clock cycles were allotted for a timing arc. It can take values like 0.5, 1, 1.5, which implies either half a clock cycle, one clock cycle, or one and a half clock cycle. M indicates the amount of reduction achievable in clock cycle time. The solution to the ILP discussed in Equation 2 pushes all the timing constraints to the edge of satisfying timing constraints. The nominal values for clock arrival times and data are not always precisely known and small variations in the input data can completely invalidate this solution.
There have been two approaches to address data uncertainty over the years (a)Stochastic programming, and (b) Robust Optimization. However, the ease with which robust optimization can be applied to real world problems has made it a popular approach to address data uncertainty. Soyster [14] proposed a robust approach which handled data uncertainty in a highly conservative manner. The feasible region in this formulation was specified via set containment instead of traditional set of convex inequalities.Soyster's formulation can expressed as in Equation 3 maximize : c x subject to :
where data uncertainty is captured using bounded random variableâij for the coefficient matrix. The coefficient takes values in [aij −âij, aij +âij] where aij is the nominal value of the coefficient. Ji is the set of aij, j ∈ Jij, that are sub- [14] that for every possible value of the coefficient, the solution remains feasible. The purpose of the term P j J iâ ij | xj | is to create a "gap" between the optimal solution of the nominal problem (i.e. P j aijx * j ) and bi to provide robustness. The clock arrival time, xi, at a given latch i is subject to uncertainty. Thus, the clock arrival time is bounded by [xi−δ, xi+δ]. The robust ILP formulation for clock skew optimization based on Soyster's approach is given in Equation 4.
Maximize M subject to:
where the S and H constraints differ from those in Equation 2 in terms of the protection they provide. y l is the additional variable for each x l defining new bounds. The constraint B3 ensures that optimality y l = |x * l | and thus each x l is represented as x * l (1 + δ) where δ was the intended percentage protection for the clock arrival times. ∆ represents intended percentage protection for the delay through the combinational block. The value (∆ · delay) is subtracted from the slack to represent delay variations. For a particular timing arc the uncertainties in the combinational block along the path alter the actual arrival time of the delay and thus the available slack. The uncertainties in the combinational block delay can thus be modeled as uncertainties in the slack. The motivation to use Soyster's formulation was to provide maximum protection against uncertainty. However, the formulation is extremely conservative. It is highly desirable to provide a mechanism to allow tradeoff between robustness and performance. A robust LP formulation proposed by Bertsimas and Sim [1] can be applied to handle parameter uncertainty and provide a lever to control the tradeoff between performance and robustness. Bertsimas' formulation is shown in Equation 5.
maximize c x subject to: (5) where Γi is the protection factor. aij,âij, yj, and Ji are defined similar to those in Soyster's formulation. Protection is provided by the term βi(x) = max {S i ∪{t}|S i ⊆J i ,|S i |=xΓ i y} { Maximize M subject to:
In Equation 9 k is the number of timing arcs in the formulation. The constraints C1 and C2 place a restriction on what values p ki and p kj can assume based on the intended percentage protection (δ) for the clock arrival times. This influences the value of Z k and thus, by varying Γ k a tradeoff between robustness and performance can be achieved. The constraints B1, B2, and I1 help to enforce quantization. ∆ is the intended percentage protection for the delay through the combinational logic path.
EXPERIMENTAL RESULTS
All solutions were obtained using the CPLEX solver 8.1.0 running on a 3GHz. Pentium 4 microprocessor with 1GB RAM. A combination of C/C++ and PERL scripts were written to construct the LP and ILP formulations. The simulations were classified under two different categories, Experiment 1 was performed without the integer constraints, thus relaxing the quantization. Experiment 2 enforced strict quantization. In each experiment, two scenarios were considered. The first scenario did not enable any CVDs in the RAD. The second scenario enabled the CVDs in both RAD and SLCBs. The reason for considering these two scenarios was to investigate the effectiveness of incorporating clock tuning elements at different hierarchies in the clock tree. Each of these scenarios had provisions for 2 different cases. The first case was 5% clock uncertainty and 2% data uncertainty, the second case was 5% clock uncertainty and 0% data uncertainty. The choice of 2% (3-σ) variation for data arcs takes into account the statistical averaging of gate delays along a data timing arc for the 180nm technology.
Experiment 1
This was used to determine the trend of variations in the percentage improvement in clock period and robustness, for the different formulations. The solutions provide upper bounds for the clock frequency possible through useful skew but do not have quantized arrival times. Table 2 shows the percentage improvement in clock frequency without RAD adjustments and without adjustable delay quantization. Table 2 shows that normal LP formulation without any considerations for uncertainty has the best improvement and Soyster's formulation shows the least improvement in clock frequency. The robust formulation shows varying levels of improvements for Γ between 0 and 2 and emphasizes the advantage of the proposed approach i.e. performance trade-off. Table 3 shows the level of robustness for each of the different formulations obtained from Monte Carlo simulations. For Monte Carlo simulations, data and clock delay uncertainties were assumed to have Gaussian distributions. The mean values of the Gaussian distributions for data were obtained from the static timing results. The 3-σ deviations for data and clock were chosen to be 2% and 5% of the mean values, respectively. The effects of delay uncertainties of both data and clock are mapped to timing slacks during Monte Carlo simulations. The column SV max reports the maximum margin by which a maxtime constraint was violated over 1000 monte carlo simulations. The column SH max reports the maximum margin by which a mintime constraint was violated over a 1000 monte carlo simulations. The percentage violation is calculated as violation(%)= no. of iterations with 1 or more mintime paths failing total no. of iterationssims.
*100 The robust formulation shows a decrease in the percentage violation as we slide Γ from 0 to 2. The data from the SV max column in Table 3 can be used as the correction factor to slow down the clock to avoid maxtime violations, also called as frequency binning. Figure 2 shows the percentage improvement in clock period after frequency binning. The solutions marked as 'Normal LP' did not consider for any protection in either data or clock for both 0% and 5% data cases and as can be seen are the worst in performance after frquency binning. It is worth pointing out that the solutions based on 5% clock and 0% data uncertainties actually ended up with a lower performance gain, after frequency binning, compared to to those based on 5% clock and 2% data uncertainties thus highlighting the need for considering uncertainties in the design. Enabling RAD in the clock tree provides additional range for clock tuning since we move a level higher in the hierarchy. Comaprisons between Table 2 and Table 4 Table 4 : %age improvement in clock freq.; with RAD enabled and without adjustable delay quantization Table 5 shows the Monte Carlo simulations for the same scenario. The results followed a similar trend, in terms of varying performance improvement and robustness as for the case when CVD's at the RAD level were disabled.
Experiment 2
The experiments described in this section provide solutions with quantized arrival times and thus are practically feasible. The LP formulations with quantization transform Table 5 : MonteCarlo simulations; with RAD enabled and without adjustable delay quantization the problem into a mixed integer programming (MIP) problem. Due to increased number of variables and constraints in the LP formulation to accommodate quantization, the convergence became an issue during the experiments. The results presented in this Section are a mix of the results from the conventional MIP approaches and our modified randomized rounding approach. Whenever the conventional MIP solver (CPLEX) could not converge to a solution, modified randomized rounding was used to find the solution. In the modified randomized approach, the corresponding relaxed integer solutions were used as a starting solution. A subset of the CVD arrival times from the non-integer solution that matched the quantization were hard coded as absolute values and the integer constraints for them were eliminated. The decision regarding which quantized arrival times be hard coded was based on a random number generator. This was done to provide more flexibility of choice for the arrival times. The best solution was tracked and updated after each iteration. Tables 6 and 7 show a decrease in percentage improvement in clock frequency and an increase in robustness going from Γ = 0 to 2. Table 9 : MonteCarlo simulations; with RAD and adjustable delay quantization the number of variables and constraints thus forcing us to use the modified randomized rounding for more cases and sacrificing optimization for run time. The run time for all formulations varied from a couple of minutes to atmost a couple of hours.
In general the trend shows that the cases with 2% data protection show greater reduction in optimal solution after quantization as they have constraints for both robustness and quantization that make the problem tightly constrained reducing the range of values for the clock arrival times. The randomized rounding further sacrifices optimality as it hard codes certain values. The effects of the reduced performance gain can be seen by comparing Tables 4 and 8 . The trends in performance gain obtained from our experimental results are in accordance with the theoretical predictions for cases where data uncertainty was assumed to be 0%, i.e. the performance gain for the normal ILP-formulation is the same as that for the robust ILP formulation with Γ = 0. However, for cases where data uncertainty was assumed to be non zero, the performace gains for the forementioned formulations are not the same. This is due to the fact that for the robust formulation with Γ = 0, as shown in Equation  9 , there still is a non-zero data protection on the the right hand side of the equation. Such a protection does not exist for the normal LP formulation in Equation 2.
CONCLUSIONS
The uncertainty due to process and environmental variations can be of significant concern in chip yield. The need for providing risk assessment and manipulation is paramount for future designs beyond the 90nm generation. Clock skew has been the bottleneck for improving system performance and useful skew to improve clock frequency has been proposed in the past. We present a novel approach which accounts for variations in both the clock arrival times and the delay through the combinational logic. The proposed approach allows designers to formulate problems with varying levels of robustness. One important observation from this work is that formulations for clock skew optimization considering less amount of uncertainties in DSM technologies can result in lower timing robustness and thus lower performance gain, compared to formulations that account for all sources of variations. The proposed method provides a single lever (Γ) to allow easy tradeoff between robustness and performance.
ACKNOWLEDGMENT
The research presented in this paper was partially funded by INTEL through a INTEL Research Fellowship. The generous support of INTEL during this project is greatly appreciated.
