In this paper, we analyze the impact of process variations on the clock skew of VLSI circuits designed in deep submicrometer technologies. With smaller feature size, the utilization of a dense buffering scheme has been proposed in order to realize efficient and noise-immune clock distribution networks. However, the local variance of MOSFET electrical parameters, such as and DSS , increases with scaling of device dimensions, thus causing large intradie variability of the timing properties of clock buffers. As a consequence, we expect process variations to be a significant source of clock skew in deep submicrometer technologies. In order to accurately verify this hypothesis, we applied advanced statistical simulation techniques and accurate mismatch measurement data in order to thoroughly characterize the impact of intradie variations on industrial clock distribution networks. The comparison with Monte Carlo simulations performed by neglecting the effect of mismatch confirmed that local device variations play a crucial role in the design and sizing of the clock distribution network.
I. INTRODUCTION
T HE DESIGN of the clock distribution network is a critical task because the performance and functionality of synchronous systems directly depend upon the clock signal characteristics. At the same time, the logical and physical design of reliable synchronization circuitry involves the combined optimization of multiple conflicting objectives such as bounded skew, noise immunity, low power consumption, area, etc. Because of this, the automatic synthesis of the clock tree network has gained considerable attention in the design community; a comprehensive overview can be found in [1] . The problem of generating a zero-skew or a bounded-skew routing tree is of primary importance for high-speed logic design. Existing clock tree routing techniques [2] - [7] are based on linear or Elmore delay [8] metrics to approximate the signal delay in every tree branch. This approximation is provably a worst case delay estimation for RC trees [9] . Unfortunately, this upper bound is often considerably loose [8] , especially for long interconnect Manuscript received July 14, 2000 . This work was supported in part by CNR within the framework of Progetto Finalizzato MADESSII.
S. Zanella, A. Nardi, and A. Neviani are with the Dipartimento di Elettronica ed Informatica, Università degli Studi di Padova, Padova 35131 Italy (e-mail: idunk@dei.unipd.it; nardi@dei.unipd.it; neviani@dei.unipd.it).
M. Quarantelli is with the Dipartimento di Elettronica ed Automazione, Università di Brescia, Brescia 25123 Italy (e-mail: quarante@ing.unibs.it).
S. Saxena is with PDF Solutions Inc., Dallas, TX 75082 USA (e-mail: saxena@pdf.com).
C. Guardiani was with Central R&D, STMicroelectronics, Agrate Brianza (MI), Italy. He is now with PDF Solutions Inc., San Jose, CA 95110 USA (e-mail: carlo@pdf.com).
Publisher Item Identifier S 0894-6507(00)09488-4.
wires routed in deep submicrometer (DSM) multiple metal layer technologies. Thus, in order to maintain a reasonable level of accuracy, it is thus necessary to make extensive use of buffers throughout the clock tree.
On the other hand, it is possible to expect that the introduction of a large number of buffers will make the tree more sensitive to intradie process variations, particularly if the effect of buffer delay mismatch is not taken into account by the clock tree synthesis algorithm. One of the earliest attempts to account for the effect of process variations on the skew was proposed in [10] ; however, only global process corner variations have been considered in this work. In [2] , the effects of both interconnect and buffer intradie variations have been incorporated into an algorithm for reliable clock tree routing; however, the authors in [2] used a simplified buffer delay model ( -factors), and the effect of local variations was taken into account only by empirically varying the -factors of 10%. Xi and Dai [3] considered only global process variations (fast and slow corner combinations for PMOS and NMOS devices) and used global derating factors, thus ignoring the effect of local variations. Finally, authors in [6] included the effect of intrachip, long-range gradient variations in an analytical model of the skew of a pipelined H-tree. As the information regarding the individual position and orientation of a chip on the wafer is lost after wafer dicing, the model derived in [6] may be difficult to apply in practice. Moreover, it is shown in this paper that the usually neglected component of mismatch that is proportional to the inverse of the channel area tends to dominate in DSM technologies.
To our knowledge, this is the first work in which a thorough experimental characterization of statistical variability is applied to the accurate analysis of the effect of device mismatch on the clock skew of VLSI ICs. In fact, as far as we know, no systematic characterization of the mismatch has ever been run. We will compare the results of accurate statistical simulation including, respectively, the effect of global variations, area dependent mismatch, and area-and distance-dependent mismatch on the clock skew for two industrial strength circuits: 1) a graphic processor unit (GPU) for consumer applications; 2) a digital signal processing core (Viterbi decoder). All this is designed by using a standard design flow in 0.25-m CMOS technology.
By using these significant examples, we will show that both global and local, area-dependent process variations may have a considerable impact on the performances clock skew and that they have to be properly modeled and taken into account for reliable, high-speed clock tree synthesis. 
II. MISMATCH MODELING AND SIMULATION TECHNIQUE
Significant efforts has been spent in the past in the modeling and characterization of mismatch effects [11] - [17] , but the lack of an effective statistical simulation methodology, incorporating mismatch effects [18] , has prevented widespread application of mismatch analysis to large circuits. In fact, a straightforward application of any device mismatch model in Monte Carlo analyses usually leads to an unrealistic amount of circuit simulations.
In this section, we first introduce the mismatch model used throughout this paper and then present the adopted simulation technique.
A. Mismatch Modeling
From now on, we will address to the mismatch between two or more devices as the difference among their (ideally identical) electrical parameter values. Such a simple definition does not provide the engineer with a deep understanding of the real sources of the aforementioned variability. However, in order to serve our purposes, it is sufficient to recall that the causes of device mismatches, which are pseudorandom in nature, can be divided into two main categories: systematic and random. Obviously, the choice of the mismatch model is constrained to those that include both of these components.
Although several corrections and extensions to the DSM region, such as those presented in [12] , [15] , and more recently in [19] and [16] , have been proposed, the model proposed by Pelgrom et al. [11] is still the most widely used in the current practice.
Pelgrom's model describes the variance of the difference ( ) of the value of a generic parameter in two identically drawn rectangular devices, T1 and T2, with length and width , whose distance between their centers is ( Fig. 1) as (1) where and are the fitting parameters for the area-and the distance-dependent terms, respectively. These are directly extracted from process characterization data.
In general, in order to account for process variability, it is possible to describe the th electrical parameter (e.g., , , , etc.) of the th MOS device ( ) as the sum of three different components (2) where is the mean of and and are random variables with zero mean and variance and , respectively. In (2), accounts for the global variability (in other words that relative to wafer and lot level variance), while accounts for local (intradie) process variability (mismatch). Without losing generality, it is possible to assume that the parameters , , are mutually independent, so that the only nonnull correlation terms are those between the same parameter of different matched devices. In fact, as shown in [20] , it is always possible to transform the initial parameters set into a corresponding set of independent random variables by using proper statistical techniques (as for example, principal component analysis [21] ). The nonnull correlation terms are not directly available from the process characterization data, as usually the variance of the difference is characterized. However, by using the model proposed in [11] (3)
It is possible to demonstrate [22] that the covariance between the same parameter of a pair of identical devices satisfies (4) where coefficient of the area-dependent mismatch component for parameter ; coefficient of the spacing-dependent component; layout distance between th and th devices. The covariance terms in (4) are then stamped in the covariance matrix , which will be finally used to generate a set of properly correlated samples for the Monte Carlo simulation, as described in the next section.
B. Correlated Sampling
In order to investigate the effect of process variations on clock skew, a set of Monte Carlo SPICE simulations of the clock network was performed. It is therefore necessary to generate a large number of correlated samples for the relevant SPICE parameters of all the buffers in the tree, where the correlation stems from the intradie variability. The Cholesky decomposition has been used as in [20] to generate a set of correlated random samples for the Monte Carlo analysis with the desired covariance matrix. In vector notation, (2) can be written as (5) where random vector of correlated variables with mean and covariance matrix ; and independent random vectors with zero mean and unit variance; , lower triangular matrices obtained from the Cholesky decomposition of two block-diagonal covariance matrices ( ) associated with global and local variations, respectively.
From the independence of and , the following relation between , , and holds: (6) which expresses the intuitive fact that the total variance of each parameter is originated by the superposition of independent global and local physical effects. In practice, rather than using (5), it is more efficient to directly extract one single random vector with covariance matrix . Because of the finite population size, the sample mean may not actually be zero. In this case, it is possible to subtract the sample average from all the generated vectors in order to obtain unbiased estimations.
C. Unified Compact Representation of Interdie and Intradie Variability
The mismatch simulation methodology described in Section II-B becomes rapidly impractical when the product of the number of matched transistors by the number of independent process factors exceeds a few hundred. In fact, the number of Monte Carlo runs that would be required in order to obtain a sample with sufficient statistical significance is certainly very large (conceivably more than a few thousands simulations), while at the same time an approach based on response surface methodology (RSM) performance macro-modeling, such as those presented in [20] , [23] , [24] , cannot be applied because of the exponential complexity of the design of experiments (DOE) step. As the clock network circuitry may easily contain several hundreds of buffers, an alternative approach is required to study the effect of intradie variations on the clock network. In our study, we decided to use the methodology described in [25] and implemented in Circuit Surfer [18] . This technique exploits the correlation between device parameters within a given component type (e.g., NFET, PFET, etc.) and that between similar parameters in matched devices in order to decompose the total (interdie and intradie) variability into a few independent, dominant sources of variance. This is achieved by an iterative application of principal component analysis (PCA) [21] to a suitably reordered representation of the system correlation matrix (7) where the block composed by the first rows and columns represents the correlation matrix of the parameter for the devices; the sublock composed by the second rows and the first columns represents the correlation matrix between the parameters and for the devices; etc. In this way, if the parameters are independent, the correlation matrix will be block-diagonal. The correlation coefficients are directly obtained from process characterization data. In this way, it is possible to obtain a structured, unified representation of both interdie and intradie variability. This unified model can be simulated with a number of independent variables that, in the worst case, is linear with the number of matched transistor (but with a much smaller coefficient than other existing approaches) and that in most practical cases 1 only requires a small fixed number of additional independent variables. The number of independent variables is a function of the maximum error the user is willing to tolerate. In [25] , it is shown that by setting the screening (minimum precision) even to a very high value, the number of required simulations is still reasonably small. It must also be noted that the higher the correlation between the variables, the lesser the number of independent factors necessary to achieve a certain screening value.
III. MISMATCH SIMULATION OF INDUSTRIAL CLOCK TREES

A. Clock Skew and Related Timing Measurements Definition
The intradie variability analysis methodologies described in the previous sections have been used to analyze the impact of buffer mismatch on two different industrial clock tree networks test cases. The relevant timing measurements that have been measured in this work are: 1) datapath delay ( ); 2) slack between data and clock arrival time ( ); 3) clock skew ( ). In order to guarantee the correct functionality of a synchronous logic circuit, a well-defined set of timing constraints has to be satisfied, as described by the following set of inequalities: (8) and (9) where is the clock period and is the setup (hold) time of the sequential cells (FF/latches) used to synchronize the data. Note that both and can assume positive, zero, or negative values, which implies that the worst case value is implicitly assumed in (8) and (9) . The relations between the different timing measurements used in our examples are illustrated in Fig. 2 .
B. Monte Carlo Simulation of the Viterbi Decoder
The statistical simulation methodology described in Section II-B has been implemented by using SIMPILOT [24] and C language code and applied to study the skew of three different clock-tree networks (clock_bmu, clock_bmu_n, and clock_sys) of a Viterbi decoder designed in 0.25-m CMOS technology. Even though these networks are relatively small, they are representative of a reasonable block size for which it is possible to avoid local resynchronization of the clock signal. Moreover, it was necessary to identify a set of realistic clock networks that could be managed by the statistical simulator 1 The worst case corresponds to a process where all the fabricated devices behave as if they were completely independent (very poor matching), which is, of course, extremely unrealistic. The best case corresponds to perfect matching (no additional variables required). The most common case is that of slightly imperfect matching (the values in the correlation matrix are all close to one), corresponding to a small number of additional independent variables. in an acceptable time. The net clock_bmu is composed by a single level of 16 buffers driving a total of 910 flip-flops. The net clock_bmu_n is composed by one inverter followed by one buffer driving four identical branches, each composed by an inverter-buffer couple. This net drives 40 flip-flops. Finally, the clock_sys net is split into two branches. The first is made by an inverter-buffer couple driving four buffers, and the second by a couple of buffers with a fanout of 16. This net drives 640 flip-flops. In every buffer, all the devices have the same size and, ideally, they are all perfectly matched. The Viterbi decoder has been synthesized from an RT description on a 0.25-m CMOS standard cell library by using Design Compiler [26] ; the physical layout has been obtained by using Silicon Ensemble [27] ; and the clock tree routing has been automatically synthesized by using CT Gen [28] . In order to reduce the size of the problem, we decided to model the process variability by using and variations only, based on a buffer delay sensitivity analysis. In order to separate the effects of global variations, area-dependent mismatch and layout distance-dependent mismatch, the following set of statistical simulation has been performed: 1) nominal; 2) Monte Carlo analysis with global variations only; 3) Monte Carlo analysis with global variations and area-dependent mismatch ( ); 4) Monte Carlo analysis with global variations, area-and distance-dependent mismatch ( ). The fitting coefficients were directly available from the process characterization data. Given the extremely low value of the interdie mismatch coefficient, we decided to use a higher value for that. The reason for this is the assumption that the interdie mismatch plays an negligible role in this clock network. Thus, if for a large a negligible effect is observed, the actual effect will be negligible as well.
The Monte Carlo analyses have been performed by using 1000 random samples and extracting the clock skew value from transient SPICE-level simulations of the clock network.
C. Monte Carlo Simulation of the Graphic Processor Unit
The second test circuit analyzed in this work is a GPU, also fabricated in a 0.25-m CMOS technology and designed by using a traditional application-specific IC design flow. This is a very large circuit (few million gates, more than 2 cm die size), for which the application of standard mismatch simulation methodology, such as that applied in the Viterbi test case, resulted infeasible, even when considering only a small partition of the entire clock network. Therefore, we decided to apply the more advanced mismatch analysis technique described in Section II-C in this case. The process data indicated that mismatch was the major source of intradie variability for this kind of process. The timing properties of the clock network relative to the GPU interface unit, as described in Section III-A, have been measured. variability that will be observed is significant, as in general we would expect standard CMOS digital processes to have worse device matching performance.
IV. EXPERIMENTAL RESULTS
A. Viterbi Decoder
The results of the set of statistical simulations performed on the clock networks described in Section III-B are summarized in Table I . It is possible to observe that the effect of area-dependent mismatch on the skew is about one order of magnitude larger than that associated with global variations only. It is also interesting to note that the introduction of the component of local variance proportional to the layout distance between buffers does not seem to have a noticeable effect on the total skew variance. In fact, as reported in Table I , the difference between the simulations performed with and without its contribution can be considered negligible. It is important to note the following.
1) Since the characterized values of the coefficients were basically negligible for the technology considered in this paper, we decided to use a larger value obtained from the characterization of a previous 0.5-m CMOS technology. The results reported in this paper about the effect of distance-dependent mismatch are thus pessimistic and have to be considered as worst case results.
2) The actual block size (less than 2 mm) obviously is such that mutual buffer distance is always relatively small in our examples. This has to be considered when analyzing the conclusions about layout distance-dependent mismatch effect. The results presented in this paper are further confirmed by the histograms shown in Fig. 3 , where the effect of global variations [i.e., by setting in (5)], area-dependent [i.e., by setting in (3)], and using the full model of variance described in (5) on the skew of the largest clock tree network (clock_bmu) are reported. Finally it is important to note that, even though for practical reasons the contribution of other process variables (e.g., , , etc.) has been ignored, the results about the large impact of mismatch on clock reliability still hold. In fact, the introduction of extra sources of variance may only magnify the spread of the skew. This is additionally substantiated by the fact that the parameters used in this work are those that have the largest impact on the buffer delay, as confirmed by a sensitivity analysis.
B. Graphic Processor Unit
The results of the simulations performed are reported in Table II . It is possible to observe that the effect of the global variations are one order of magnitude greater than the effects of intradie variations. This is not actually unexpected, because the loads in the clock network are not well balanced. This leads to an increased effect of the global variability that would not normally be observed. It is important to note the following.
1) The uncertainty on the slack is much greater than the other two cases. In fact, the slack is influenced by both the clock skew and data timing uncertainty. Its value, anyhow, it is possibly due to the fact that the process is not yet stable.
2) The uncertainty on the clock skew is smaller than the one found before. This is possibly due to the fact that the correlation coefficient used are may be too high for a digital design case. A smaller coefficient should be used to make the simulation more realistic. In this case, the result should not be very different from the one found in the previous section.
V. CONCLUSION
In this paper, we have analyzed the impact of intradie process variability on the clock skew of VLSI blocks designed in DSM technologies. In order to incorporate the measured matching data into accurate Monte Carlo statistical simulations of the clock network, advanced techniques have been applied.
In the first example, a set of correlated random samples have been generated by using the Cholesky decomposition of the covariance matrices relative to inter-and intradie variations. This methodology has been applied to study the effect of process variations on the skew of three different portions of the clock tree of a VLSI macro-cell (Viterbi decoder) implemented by using a standard commercial design flow and an industrial DSM standard cell library. The experimental results have shown the following for the Viterbi decoder. 1) Global process variations have a negligible effect on the clock skew. This is possibly a consequence of the fact that the wire widths and the buffer loading have been properly sized and balanced. 2) Skew variance due to the local (intradie) mismatch effect is about one order of magnitude larger than that due to global variations 3) The effect on the skew variance of the component of local variability due to layout distance is negligible compared to that of the area-dependent component . This shows the importance of considering device mismatch for reliable, bounded-skew clock tree synthesis of circuits designed in DSM technologies.
In a second example, a statistical mismatch model order reduction technique has been applied to simulate the clock tree GPU. In fact, the size of the circuit and the number of transistors to be matched is such that it is impossible to apply any straightforward MC method. In this case, the experiments have shown that the effect of interdie variations are of an order of magnitude greater than the intradie variations.This is a consequence of the fact that the GPU clock tree is much more interconnect dominated than that of the Viterbi decoder. In fact, unbuffered wires in excess of 3 mm of length can be observed in this case. This, of course, makes this particular design more prone to both front-end of line and back-end of line (BEOL) interdie variations, with a predominance of BEOL variations under particular environment conditions. We may therefore conclude that both interdie and intradie variations may have a significant effect on the performance of DSM clock trees, depending on the design and routing strategies. Therefore, the design of a clock distribution network should take into account all the possible causes of performance degradation, including, in particular, inter-and intradie process variation. In particular, we have also shown that effects such as mismatch, which has been traditionally deemed to be a sheer analog effect, should also be of concern for digital designs.
ACKNOWLEDGMENT TheauthorsaregratefultoR.Burger,P.Delpech,andP.Llinares of ST Microelectronics for providing the design of the Viterbi encoder and the process characterization data; to T. Lin and N. Dragone of Carnegie-Mellon University for the help on physical clock tree routing and for many useful discussions, and to P. McNamara of PDFSolutionsInc. forhelp on the GPU simulations.
