Manufacturing process variability impacts the performance of synchronous logic circuits by means of its effect on both clock network and functional block delays. Typically, variability in clock networks is either handled early in the design flow by assigning margins to clock network delays, or at a later stage through post-processing steps that only focus on achieving minimal skew, without regard to functional block variability. In this work, we present a technique that alters clock network lines so that the circuit meets its timing constraints at all process corners. This is done near the end of the design flow while considering delay variability in both the clock network and the functional blocks. Our method operates at the physical level and provides designers with the required changes in clock network line widths and/or lengths. This can be formulated as a Linear Programming (LP) problem, and thus can be solved efficiently. Empirical results for a set of ISCAS-89 benchmark circuits show that our approach can considerably reduce the effect of process variations on circuit performance.
INTRODUCTION
The performance of a synchronous logic circuit depends on both circuit delays and clock skews introduced by the clock distribution network. Clock skew is defined as the difference between the arrival times of the clock signal at different clocked circuit elements. Typically, clock networks are designed to minimize skew between the various clocked elements, where the objective in this case is to achieve a zero-skew clock tree, as in [1] and [2] . On the other hand, the more aggressive techniques of skew optimization or clock scheduling (e.g. [3] , [4] , and [5] ) actually assign skews to different clocked elements in order to obtain better performance. These usually formulate optimization problems to find clock network path delays that improve circuit performance. However, in practice, this requires interaction among different stages of the design flow [6] and extends to beyond finding clock network delay values.
With the scaling of VLSI technology, the effect of manufacturing process variations on circuit and clock network delays has increased. Typically the effect of this on clock networks is seen in the form of unintended skew, which degrades circuit performance. In the case of zero-skew trees the focus has been on minimizing the maximum skew resulting from process variations. This is typically * This project was supported in part by Intel Corp.
done by wire sizing and/or introduction of buffers (e.g. [7] and [8] ). Such approaches try to "cap" the maximum skew expected as a result of process variability. However, these approaches deal with variability in the clock network without regard to that in the functional blocks, and are focused on achieving zero-skew rather than timing closure. On the other hand, skew optimization techniques usually deal with variability by incorporating margins [3] , or permissible ranges [5] , into their optimization formulation. However, this is done at an early stage in the design flow when accurate variability information of the clock network and functional blocks is typically not available and is thus problematic.
In this work we present a linear programming (LP) formulation in order to reduce the effect of process variability on circuit performance. Our approach specifies required changes at the physical level in clock network wire widths or lengths rather than required clock network delay values. It is applied near the end of the design flow as a post-processing step to account for process variability when accurate variational clock network and functional path delays are available. As stated earlier, modifying clock line widths or lengths as a post-processing step has been proposed to minimize the unintended skew of zero-skew clock trees. However, in this work we formulate an LP that varies skews to try to meet circuit timing constraints at all process corners rather than try to eliminate skew.
PRELIMINARIES
In this section, we first present the process parameter model assumed in our work. Then, we present some definitions and preliminary concepts that are used throughout the paper.
Variability Model
In our approach, all logic cell and interconnect delays are modeled as linear functions of normalized process parameters, whose values vary between −1 and +1. Hence the delay of a logic cell or of an interconnect RC-tree (also referred to as an interconnect structure) can be written as an affine function of these process parameters. Because the delays of individual paths, in both clock networks and functional blocks, are sums of gate and interconnect delays, they also become such affine functions of the process parameters. Let the number of process parameters under consideration be p. Thus, a timing quantity t, representing a timing arc, interconnect, or path delay, can be written as follows:
where X = (X 1 , . . . , Xp) is the set of normalized process parameters,t is the nominal value of t, and δ l is its sensitivity to process parameter X l . Such an affine function represents a hyperplane in (p + 1)-dimensional space, and will often be referred to as, simply, a hyperplane. In this work, it is often required to find the maximum or the minimum value of a hyperplane over the process parameter space. It is trivial to see that the maximum value of t over all process corners is:
which is equivalent to computing t at the process cornerX = (X 1 , . . . ,Xp), such thatX l = +1 if δ l ≥ 0 andX l = −1 otherwise. Minimizing t involves a very similar operation where the hyperplane is instead computed atx = (x 1 , . . . ,xp),
Circuit Model
We focus on synchronous sequential circuits with edge-triggered registers, but the work can be extended to circuits with levelsensitive latches. In such circuits, every combinational logic block receives its inputs from its input registers and stores its outputs in its output registers. We denote by S the set of combinational blocks in a given circuit. For example, Fig. 1 , shows a simple sequential circuit, where in this case S = {s 1 , s 2 }. A clock signal is connected to all the registers of a sequential circuit by means of a clock network that consists of buffers and interconnect. Such a clock network can be designed to minimize skew between the clock signals at the various registers, or it can be designed such that it intentionally introduces skew at carefully chosen registers in order to help meet timing constraints. In our approach, we refer to the clock signal received at a register as the register's clock phase, and we call the delay of the path from the clock source to the register its clock phase arrival time. In general, for a sequential circuit with m registers, the clock phase arrival times are referred to as rq, 1 ≤ q ≤ m. For a combinational logic block s ∈ S, D ij s refers to the largest delay between the input register controlled by clock phase i and the output register controlled by clock phase j. The term d ij s is used to refer to the smallest such delay. For example, in Fig. 1, D 24 1 is the largest delay between reg 2 and reg 4 through s 1 whereas d 24 1 is the smallest such delay. The clock phase arrival times rq of a sequential circuit are path delays and hence, in our formulation, are hyperplanes in the set of process parameters X. Consider the set P ij s of all paths through the combinational logic block s between the source register controlled by clock phase i and the sink register controlled by clock phase j. The delays of these paths are, each, a hyperplane in the process parameters, for which D [9] and [10] ) on how these can be found and represented. Typically, such a surface is represented by a set or a "list" of hyperplanes, each of which corresponds to a particular path. Thus D ij s is the set of hyperplane delays of "potentially longest paths" and d ij s is the set of hyperplane delays of "potentially shortest paths". However, in our method, it is also possible to approximate D ij s and d ij s by single hyperplanes, obtained by finding hyperplane bounds for these two sets, using the approach in [11] . That is, D ij s would be replaced by a hyperplane approximation of the "max" of the delays of the "potentially longest paths", and similarly d ij s would be replaced by a hyperplane approximation of the "min" of the "potentially shortest paths" delays. These approximations are conservative, i.e., they never underestimate the true maximum delay or overestimate the true minimum delay for any point in the process space.
BACKGROUND
In our approach, we modify the delays of the clock network in order for the circuit to meet its timing constraints in the presence of process variations. This would be applicable near the end of the design flow, as part of physical design while trying to achieve timing sign-off, by modifying the clock network line widths and/or lengths. In this section, we first present a short overview of standard clock tree synthesis and a brief description of how clock skew optimization could be used to improve the performance of a circuit in a typical design flow. We then discuss the effect of process variability on circuit performance and how a circuit's timing constraints can be verified under variability. This will provide some background understanding of the problem and set the stage for the presentation of our proposed clock network optimization method.
Clock Tree Synthesis
In order to synthesize a clock network, a user can specify certain parameters such as maximum skew, maximum and minimum clock phase arrival times, and a host of other parameters. These are then used to generate a clock network consisting of buffers and interconnect, where the objective would be to have a "balanced" clock tree. That is, designers usually strive to minimize skews between different leaves of the clock tree. In order to achieve circuits with higher frequencies, or higher slacks, a user can use skew optimization techniques such as those presented in [3] and [4] to find an "optimal" set of clock phase arrival times. These techniques lead to a set of nominal clock phase arrival times or a clock schedule, which are then translated into a physical structure consisting of buffers and interconnect. The method in [3] is an LP formulation that provides the clock phase delays that maximize the clock frequency for circuits with edgetriggered registers, as follows. Letrq, 1 ≤ q ≤ m, be the nominal clock phase arrival times, and letD ij s andd ij s be, respectively, the largest and smallest nominal delays through the combinational logic block s between the source register with clock phase i and sink register with clock phase j. The minimum acceptable clock period T can be found using the following LP [3] :
To allow this general formulation, the convention is adopted that, if logic block s does not contain a combinational path between registers controlled by clock phases i and j, thenD ij s = −∞ andd ij s = +∞. For a required clock period of Tc, if T ≤ Tc then the clock network which achieves the values ofrq, 1 ≤ q ≤ m is generated, placed, and routed. However, achieving a clock network that can produce such arrival times is more complex than achieving a minimal skew tree, and clock skew optimization remains an active research area [6] . In any case our method can be used irrespective of whether clock skew optimization is used to generate the clock network or not.
Verification for Variability
Using the layout of the circuit resulting from either the standard or skew optimization nominal point analyses as a starting point, along with characterized cell and interconnect process sensitivities, variational path delays of the circuit can now be extracted. For a register with nominal clock phase arrival timerq, its actual rq under variability is a hyperplane, given by:
The dependence of combinational path delays and clock phase arrival times on process parameters may cause some timing constraints to fail for some process settings. In order for the circuit to meet the timing constraints under variability, then, for every logic block s, the following inequalities should be satisfied over the whole process parameter space for all clock phases i and j:
which is equivalent to saying that:
Note that here we make the simplifying assumption that t j setup and t j hold are constant, rather than dependent on process parameters. Therefore, these are not included in the max and min expressions. However this assumption is by no means necessary, and doesn't affect the applicability of our approach. Performing the verification in (6) 
Writing r i and r j as in (4), we can now write (6) as:
Note that the expressions inside the "max" and the "min" operations are hyperplanes. Finding the maximum or minimum value of a hyperplane over the process space is a straight-forward operation that is performed by computing the value of hyperplane at a carefully chosen corner, as described earlier in (2) . For the constraints in (8), we use the term "worst-case" corner to refer to the process corner used in the verification of the constraint. Thus for a setup constraint the "worst-case" corner is the process corner used to maximize its hyperplane, whereas for a hold-time constraint this term refers to the process corner that minimizes its hyperplane. Now consider the case when D 
Then, in order to verify the setup constraint of D ij s we have to verify the following constraints:
These constraints are verified by writing each at its "worst-case" corner as is done for the setup constraint of (8) . However, note that "worst-case" corners for these constraints can be different. Let Tv be the smallest clock period for which all the setup constraints of a circuit are satisfied under process variability. This value can be found by evaluating the left-hand side expressions of all setup constraints, such as (10) , and choosing the largest value. We call Mv = Tv − Tc the required timing margin; this is the margin that has to be added to Tc so that the circuit can operate correctly in the presence of process variability. The same approach is taken for d ij s , where its hold-time constraint is also written for each of the hyperplanes that form its piecewise planar surface as in (10) and may be verified in a similar manner.
OPTIMIZATION
If the verification process described above reveals that the circuit does not meet its timing constraints for a required Tc under all process settings, then an optimization problem can be formulated to modify the clock network so that the required timing margin is minimized, thus possibly allowing the timing constraints to be satisfied at all process corners, as follows.
A set of physical parameters of the clock tree, such as wire lengths and/or widths, are chosen as variables to be optimized. By varying these optimization variables, the different clock phase delays can be modified to minimize the required timing margin. However, before we explain how this is done, we will first examine how varying the physical parameters of an interconnect RC-structure of a clock tree affects its clock phase arrival times.
Physical Parameter to Clock Phase Delay
Consider Fig. 2 , which shows a generic segment of the path between a clock source and a register, consisting of the interconnect RC-structure con k , its "input buffer" buf k , and one of its "output buffers" buf k+1 . We are interested in the effect that varying a physical parameter of con k would have on the clock phase arrival time r seen in Fig. 2 . We will use variation in wire width as the physical parameter of interest in our discussion, however all of the arguments that follow can be made for a variation in wire length as well. Let w k be the wire width of con k , and Δw k be a change in w k . A variation in w k would have an immediate effect on the delays of the three "local" circuit elements, con k , buf k , and buf k+1 . First, the introduction of Δw k would vary the delay of con k by Δdcon k . Second, the effective load capacitance seen by buf k would vary as a result of Δw k , thus leading to a variation Δd buf k in its delay, d buf k . In addition to the variation in the delay of buf k , the change in its effective load capacitance also leads to a variation in its output signal slew. This in turn varies the delay of buf k+1 by Δd buf k+1 . One could argue that this variation in output signal slew of buf k would also lead to a similar effect in buf k+1 , which would then lead to the same effect in buffers further downstream. This would extend the effect of Δw k to "non-local" circuit elements. However, the impact on the delays of these elements is small in practice, and it can be safely ignored. Thus, the delay variations in the three "local" circuit elements seen in Fig. 2 would account for most of the variation seen in r. This variation, Δr, can now be written as:
In our approach, a linear model is used to approximate the dependence of delay variations of circuit elements on Δw k . Thus, a first order Taylor series approximation of d buf k , dcon k , and d buf k+1 , around the original value of w k is used to write: (4), and d buf k , dcon k , and d buf k+1 as follows:
Then, using (4), we write:
and, using (14) and (13), we deduce that:
Modeling Multiple Variable Parameters
So far, we have presented an analysis of the effect of varying a single physical parameter on clock phase arrival time. We now describe how one can capture the effect of varying multiple physical parameters on clock phase arrival times. Without loss of generality, we choose to vary only the widths of the various interconnect structures of the clock network. Let W represent the vector of interconnect widths of the clock tree, and let ΔW be the vector of variations in these widths. Let the number of interconnect segments in the clock tree be n, and we write ΔW = (Δw 1 , Δw 2 , . . . , Δwn), where each component represents the variation in an interconnect segment in the clock tree. For an arbitrary clock phase arrival time r, the introduction of ΔW leads to a change Δr:
Because Δr is the aggregate result of n "local" delay variations in the clock tree, the result of (13) is generalized to write:
Using (18) and (15) we can write:
Formulation of the Optimization Problem
The aim of the optimization step is to determine the required changes in interconnect wire widths, ΔW = (Δw 1 , Δw 2 , . . . , Δwn), given allowable bounds ΔW min ≤ ΔW ≤ ΔWmax, so that the required timing margin, M , is minimized, given a required clock period Tc. One possible way to achieve this is to extend the work in [3] by finding an optimal assignment ΔW such that it minimizes M = T − Tc, where T is the clock period for which the timing constraints are met under all process corners. This can be expressed as follows:
Note that r i , r j , D ij s , and d ij s are extracted affine functions that depend solely on the process parameters X l , 1 ≤ l ≤ p, whereas the variations in clock phase arrival times, Δr i and Δr j , depend on both Δw k and X l , as shown in (19), and can be written as:
If the minimum value, M , of T − Tc is such that M ≤ 0, then the the timing constraints of the circuit can be met in the presence of process variability. Otherwise, the constraints simply cannot be satisfied given both the required clock Tc and the allowable ranges on ΔW . In that case, achieving sign-off will require different changes in the design, not only tweaking the interconnect wire widths, as we propose here. In any case, M corresponds to the smallest required timing margin achieved for ΔW ∈ [ΔW min , ΔWmax], in order to meet timing constraints at all process corners. 
LetX ij s = (X 1 , . . . ,Xp) be the process corner that maximizes the expression in (24). Note that this is the pre-optimization "worstcase" corner of (23), i.e., when Δr i = 0 and Δr j = 0. This corner is used to write (23) as:
where writing D ij s + r i − r j at this corner gives a constant value K:
and using (21), writing Δr i atX ij s gives:
and similarly, Δr j can be written as:
Thus writing (23) atX ij s gives the following constraint:
Note that this constraint is now linear in the optimization variables Δw k , 1 ≤ k ≤ n, and no longer has a "max" expression over the process parameters. The same transformation can be done for the rest of the "max" and "min" operations in all of the setup and hold time constraints, respectively, to give the following optimization problem:
All the constraints in (30) are now linear in the optimization variables, so that (30) is an LP which can be solved efficiently using commercial LP solvers. Next, consider the case when D (9), then a constraint similar to (29) is written for each of D (1) , . . . , D (u) . This would result in an LP that is very similar to (30), but which has a larger number of constraints. However, since actual path delays are now used in the optimization rather than conservative estimates, this would result in a less pessimistic and more accurate minimum value of the required margin T − Tc.
In its present form, (30) performs an optimization where each constraint is written at its own pre-optimization "worst-case" corner, which was found during the verification step. For a particular constraint, say (23), transforming it to (25) and using the latter in (30) is meant to ensure that the post-optimization circuit would satisfy (23) for those process settings Xv such that:
However, there is no guarantee thatX ij s would be the "worst-case" corner for the post-optimization expression (D ij s + r i + Δr i − r j − Δr j ). The reason for this is that the introduction of Δr i and Δr j alters the process parameter sensitivities of the expression inside the "max" operation in (23), as can be seen from (21) and (22). Thus, in order for (30) to ensure that the timing margin found allows the circuit to pass timing at all process corners, we must make sure that possible changes in "worst-case" corners are accounted for in (30). In the next section, we propose a method that deals with this possibility, in a conservative fashion, while preserving the linearity of the optimization problem.
COVERING ALL CORNERS
In this section, we present a method that can be used to determine whether the post-optimization "worst-case" corner of a timing constraint in (20) could possibly be different from its preoptimization "worst-case" corner. After that, we present our proposed steps to modify (30) if a constraint is found to be as such. The aim of this modification is to ensure that the solution of (30) guarantees that the circuit would pass its timing constraints for all process corners.
Possible Changes in Worst-Case Corners
Recall from Section 3.2 that the process corner that maximizes or minimizes an affine expression in the space of process parameters can be found by looking at the signs of its sensitivities to each of the these parameters. Therefore, in order to determine whether the "worst-case" corner of an expression might change post-optimization, we need to find which are the sensitivities whose signs might change as a result of the optimization performed. One way of doing this is as follows. Consider the setup constraint in (23) and let t = D ij s + r i − r j as in (24) and let t be as follows:
For the given bounds on ΔW , ΔW min and ΔWmax, we want to determine whether the sign of the sensitivity to a process parameter X l , 1 ≤ l ≤ p in t could have a sign opposite to its sensitivity in t. Using (21) and (22) we can write the sensitivity to X l in t , denoted by t l , as follows:
where t l is the sensitivity of t to parameter X l , which is known from (26). In addition to determining whether t l and t l are guaranteed to have the same sign or not, we also want to determine the largest absolute value that t l can assume, if it is possible for it to have a sign different than that of t l . This is not necessarily the largest magnitude that t l can have when it has a sign opposite to that of t l , but rather the largest absolute value it can have, irrespective of sign. Assume, without loss of generality, that t l is negative. Determining whether t l can be positive is done as fol-
The value of t l is then computed using these assigned values. If the computed t l is positive then it is "at risk" of a sign change, and this value is the largest positive value it can assume. In this case, the smallest negative value of t l is also computed in a similar way and is compared to its largest positive value to find its maximum absolute value which we call t max l .
Accounting For Changing Corners
For a timing constraint in (20), once all the sensitivities that are "at risk" of having different post-and pre-optimization signs have been determined, we can proceed to modify (30), as follows. Let the constraint under consideration be the setup constraint shown in (23), and let t and t be as shown in (24) and (33), respectively. Assume, without loss of generality, that for process parameters X 1 , . . . , Xc, their sensitivities in t were found to always have the same signs as their sensitivities in t, while the signs of the sensitivities of process parameters X c+1 , . . . , Xp were found to be possibly different in t than in t. 
Let the pre-optimization "worst-case" corner of (23) beX 
Note that writing the constraint in (35) is more conservative than the constraint written atX ij s and hence the latter can be removed from (30). For a hold-time constraint, a bound B h is also computed with the difference that the bound is added to the righthand side of the inequality and not subtracted from it to write: Such constraints would ensure that all "potentially worst-case" corners are covered by using the bounds Bs and B h to tighten the timing constraints of (30). Of course, this comes at the price of introducing some pessimism in the LP.
RESULTS
In order to test our approach, we have selected a set of circuits from the ISCAS-89 benchmark suite. These circuits were synthesized and mapped to a 90nm CMOS library, and then placed and routed using available commercial tools. The HSPICE netlists for the logic blocks and clock trees of these circuits were then extracted. Nominal delays and slews were characterized for all cells in the library, and a set of 10 parameters X 1 , . . . , X 10 was selected to model process variations. The ranges of those parameters were chosen such that their combined effect on the delay or slew of a cell or an interconnect resistance or capacitance value is 15%. The sensitivities of gate delays and interconnect RC-trees to these process parameters were randomly generated, in order to provide difficult test-cases. With the above variational models, the HSPICE netlists of the different test circuits were fed into our STA timing engine, which was implemented in C++ and extended to handle parameterized static timing analysis (PSTA).
Two PSTA flows were implemented, and consequently two sets of experiments were compared. The first PSTA flow, based on [9] , provides an exact analysis as it propagates delays in the circuit using piecewise planar delay models, where all "potentially critical path" delays are preserved at every node. Hence, D Using the propagated delays from the exact PSTA, we start by computing the pre-optimization required timing margin, Mv, which guarantees that all constraints are met at their "worstcase" corners. Mv can be easily computed using our verification method in Section 3.2 by maximizing the left-hand side of every setup constraint written as in (10) , and recording the largest value achieved over all constraints to find the smallest clock period Tv that will allow correct operation. The required clock period Tc is then subtracted from Tv to find Mv. We then ran our LP formulation of (30) to find the minimum achievable margin M . In our LP, the optimization variables are chosen to be variations in interconnect wire widths of the clock tree Δw k , 1 ≤ k ≤ n where n is the number of interconnect structures in the clock tree. The bounds on variations in wire widths were set such that 0 ≤ Δw k ≤ w k , where w k is the pre-optimization wire width. In other words, wire widths are allowed, at most, to double, as a result of the optimization.
The minimum post-optimization timing margins, M exact and M cons , achieved based on exact and conservative PSTA flows, respectively, are shown in Table 1 , in addition to the pre-optimization margin Mv. Given the above bounds on the allowable wire changes Δw k , we observed improvements ranging from 50% − 136%. This shows a considerable reduction in the required margin which allows the timing constraints of the circuit to be satisfied for all process corners. Our results also show that although using piecewise planar values guarantees a better post-optimization margin, for most circuits, the difference between the improvements achieved in the two approaches is small. We also recorded and compared the runtimes of the optimization for both the exact and Table 2 show that although the hyperplane approach is more conservative in terms of post-optimization margin, it typically produces a speed-up between 1.16× and 2.5×. Thus, we see that one is faced with a runtime-accuracy tradeoff, where a good speed-up is achieved at the expense of some pessimism that is introduced to the analysis.
CONCLUSION
In this work, we presented a technique that modifies clock network wire widths and/or lengths so that a circuit meets its timing constraints at all process corners. Our method considers delay variability in both the clock network and the functional blocks, and is applicable near the end of the design flow. We showed that the problem of finding the required changes in line widths and/or lengths can be formulated as a Linear Program, which can be solved efficiently. Using our clock skew tuning approach, designers can considerably reduce the effect of process variations on circuit performance, as shown in our results.
