Abstract-As CMOS technology scales, circuit performance becomes more sensitive to manufacturing and environmental variations. Hence, there is a need to measure or monitor circuit performance during manufacturing and at runtime. Since each circuit may have different sensitivities to process variations, previous works have focused on synthesis of circuit performance monitors that are specific to a given design. In this work, we study the potential benefit of having multiple design-dependent monitors. We develop a systematic approach to the synthesis of multiple design-dependent monitors, as well as a corresponding delay estimation method.
I. INTRODUCTION
Circuit performance variability continues to increase due to process variability, wide operating ranges, and other factors. Performance variability can often be compensated if accurate circuit performance estimation is available. For example, (1) circuit performance can be estimated early in the manufacturing flow for process tuning, or (2) circuits with adaptive mechanisms can optimize the tradeoff between energy and performance based on feedback from runtime circuit performance monitors. In this paper, we define circuit performance monitoring as a process which estimates the worstcase delay of a circuit, based on measurements obtained from on-chip monitors.
Previous works on VLSI circuit performance monitoring can be classified according to the taxonomy shown in Figure 1 . Generic monitors range from simple inverter-based ring oscillators (ROs) to more sophisticated process-specific ROs (PSROs) [2] and alternative monitoring structures such as phase-locked loops (PLLs) [11] . However, such generic monitors are inadequate to capture design characteristics such as mix of device types, which cause differing responses to process variations. As a result, delay estimation using generic monitors is less accurate and hence incurs larger margins.
Design of monitoring structures that are correlated to circuit performance (design-dependent monitors) has been addressed in several ways. Liu and Sapatnekar in [13] propose a method to synthesize a single representative critical path (RCP) for postsilicon delay prediction. The RCP is designed such that it is highly correlated to all critical paths for some expected process variations. This approach uses only a single RCP to estimate the worst-case delay of multiple critical paths. Since the critical paths may have different sensitivities to process variations, using multiple RCPs can potentially improves delay estimation accuracy. The tunable replica circuit (TRC) method in [9] can synthesize different delay paths to more flexibly mimic circuit performance, but has higher design overhead compared to RO approaches. TRC also requires costly calibrations to obtain configurations that correspond to different operating conditions. By coupling process parameters extracted from parametric monitors with a design-specific delay model, more accurate and design-dependent delay estimation can be obtained from generic test structures [4] [15] [6] . Such an approach is flexible because an arbitrary delay model can be used and calibrated post-manufacturing. Meanwhile, parametric monitors can be designed such that they are highly sensitive to the targeted process variation. However, this approach requires many calibrations and resources for storage and computation of parameters.
Another class of design-dependent monitors [3] [10] [14] [16] [19] [20] estimates circuit performance by tracking delays of critical paths. Although these monitors show good estimation accuracies, having a monitor per path incurs high area overhead as well as longer design turnaround time.
In this paper, we propose a systematic methodology to synthesize multiple design-dependent ROs (DDROs) for circuit performance monitoring. A crucial and enabling observation is that critical path delay sensitivities form natural clusters (see Figure 4) . Therefore, we can capture the designspecific delay sensitivities by synthesizing a monitor to match the delay sensitivities of each cluster. This approach has a lower implementation overhead compared to tracking each critical path because the number of clusters is much smaller than the number of critical paths.
Our DDRO approach offers several potential benefits compared to previous works. First, DDROs are more accurate compared to conventional ROs because they are synthesized to match the delay sensitivities of critical paths. Second, DDROs are more accurate compared to a single representative critical path because multiple DDROs are used to account for the differences between critical paths. Third, DDROs are less intrusive compared to in-situ monitoring methods.
Fourth, DDROs can be used during early manufacturing stages since the DDRO routing can be limited to local metal layers. Fifth, the total number of ROs (silicon area) is greatly reduced due to the clustering of critical paths. Only a few DDROs are required to provide accurate delay estimation. Finally, DDROs can be used for both early process tuning and real-time performance monitoring. Switching the monitoring purpose is simply a matter of redefining target variation sources (manufacturing or real-time variations) with minimal design modifications.
Our experimental results below show that use of multiple DDROs can reduce delay overestimation by 15% -25% compared to use of only one DDRO. Additional results show that the mean delay overestimation of our delay estimation method has negligible difference compared to a reference method, but the number of parameters used by our estimation method is significantly reduced compared to the reference case. Our contributions are summarized as follows.
• We propose a systematic methodology to design multiple DDROs for chip frequency estimation. Experiments show that using from 3 to 7 DDROs can achieve similar performance as "perfect" replica-based monitors.
• We propose a method to estimate chip delay and minimize guardband margin by using multiple DDRO measurements, within practical limits on information exchange between design and manufacturing. In the following, Section II gives an overview of our methodology. We present two delay estimation methods in Section III. In Section IV, we discuss implementation details of DDRO synthesis. In Section V, we present experimental data to illustrate the use of DDROs to estimate circuit timing. Finally, we summarize our conclusions and future work in Section VI. All notations in this paper are defined in Table I. II. OVERVIEW OF DDRO APPROACH An overview of our monitoring strategy is shown in Figure 2 . First, we extract critical paths of a design and characterize their delay sensitivities to variation sources. Delay sensitivity of path i (V path i ) is obtained using finite differences, i.e., is the nominal delay of path i. Second, we cluster the critical paths based on their path sensitivities, and synthesize DDROs to match delay sensitivity of the clusters. By matching DDRO and cluster delay sensitivities, we ensure that the synthesized 
Error function of Gaussian distribution P (·) Accumulative probability function
DDROs have good correlation with the critical paths. Since we use only standard cells (gates) to synthesize the DDROs, the design and placement of DDROs can be easily integrated with conventional implementation flows. Based on DDRO frequencies, we can estimate chip delay during manufacturing or runtime. A circuit performance monitor typically feeds back the estimated delay with some margin to reduce the probability of reporting an underestimated delay value. However, the margin should be minimized to avoid significant performance loss due to a pessimistic delay estimation. Thus, we define the goal of circuit performance monitoring as:
Here d overestimation,
is the probability that d max is larger than d max , N is the total number of critical paths of a chip, and Z is a user-specified confidence.
III. PROPOSED METHOD

A. Delay and Variation Model
In this work, we use the variation model in [7] , whereby lot-to-lot, wafer-to-wafer, and die-to-die process variations are lumped and modeled as global variation of a chip. The global variation also includes supply voltage and temperature fluctuations. Within-die gate delay mismatch is modeled as uncorrelated Gaussian random variables. Spatial correlation is ignored in the current work as it is small for most chips [7] . When the effect of spatial correlation is significant, DDROs can be distributed within a die as in [17] to improve correlations between DDROs and critical paths. The critical path delay is represented as a linear function of the variation sources, i.e., . . .
where G is a [Q × 1] vector that represents the global variation of Q variation sources, l path i is local delay variation of the i th path, R is the correlation matrix for local delay variation, and N (0, 1) are independent Gaussian random variables.
To verify the accuracy of our delay model, we first simulate a critical path using HSPICE [24] with a set of random global variations (100 trials) 1 . Then, we compare the simulated path delays with the ones calculated from the linear delay model in (3) . Figure 3 shows that path delays obtained from the linear model correlate very well with those from HSPICE simulation.
For DDROs, we use the same delay model as in (3) . Since each RO has many identical gates, uncorrelated local variation is insignificant due to averaging of uncorrelated delay deviation. Therefore, we do not model local variation in the DDROs, i.e.,
where d nom ro is the nominal delay of the DDRO (obtained from simulation) and V ro k is a [1 × Q] vector that represents delay sensitivity of the k th DDRO to global process variations.
1 Variation sources are listed in Table II . 
B. A Reference Approach
A straightforward delay estimation method is to extract global variation using multiple process variation-specific monitors and calculate chip delay based on the linear delay model in (3) . In other words, monitoring methods in [4] [15] and [6] can be combined and extended for delay estimation.
However, we use this approach only as a reference because it requires a large amount of memory to store parameters, as well as long computation time.
Given M DDROs, we can represent V 
Equation (6) shows that d path i
consists of a measurable term and an uncertainty term. While the value of the measurable term can be determined from the delays of DDROs, the value of the uncertainty term cannot be measured directly. Since a larger u i leads to a larger d max − d max in (2), we should choose the value of b ik to minimize u i .
Assuming that G is multivariate Gaussian, we can calculate the distribution of d max using the method in [18] based on (6) and (3). I.e., we repeatedly approximate the maximum of two path delays as a Gaussian distribution by matching the first two moments. After that, we can approximate the maximum delay
where µ(d path i
) and σ(d path i
) are the mean and standard deviation of the i th path delay. Given
max can be readily obtained using the erf function for Gaussian distribution:
C. Clustering
The next step is to minimize delay margin and find V ro k . Equations (7) and (8) show that the value of d max is mainly determined by the V res path , i.e., a larger V res path will increase the magnitude of σ(d path i
), which leads to a larger d max . Therefore, it is desirable to select V ro that minimizes V res path . We find V ro k by clustering critical paths with similar V path into the same group, then assigning the centroid of the cluster as V ro k . To cluster the paths, we use the kmeans++ algorithm in [1] and choose the best clustering solution in 100 random starts. The objective function of the clustering is defined as
where the summation is taken over paths i in cluster k, and w i is the probability of a critical path delay exceeding the clock period of the design. The weight factor w i is added so that we can impose a higher penalty for having mismatched delay sensitivities on a path with higher probability to fail (less timing slack). Based on the delay model in (3) and the distribution of variation sources, we can calculate the delay distribution of path i and extract w i . Since the upper bound for V Delay sensitivity -Vdd (%) Fig. 4 . Every dot in the figure represents a critical path's delay deviation for 20mV deviation in supply voltage (Vdd) and 15 degrees C deviation in temperature (temp). The critical paths are extracted from an ARM M3 processor (45nm technology) and simulated using HSPICE. We cluster the paths into 5 clusters and label them by different colors. The centroid of each cluster is marked by a black cross.
D. Proposed Delay Estimation Method
The reference method requires N × (M + H) parameters for runtime delay estimation. To reduce the number of parameters, we propose to design DDROs such that each of them is similar to the maximum delay distribution of each cluster. We calculate the maximum delay of paths in each cluster using the method in [18] , assuming that the means of path delays correspond to their nominal values. The outcome of this step gives us the expected maximum delay. But more importantly, it also extracts the sensitivity of the maximum delay to variation sources (V max ). Similar to the reference approach, we represent V max as a function of V ro :
where
denotes the delay of the x th cluster, d
nom clust x represents the nominal delay of the x th cluster, and r x represents the random local delay of the x th cluster. After measuring DDROs, we can calculate the maximum delay distribution of a chip d max as
Then, based on the distribution of d max , we can find the value of d max as in (8) . Using this approximation method, the total number of parameters reduces from N (1 + M + H) to M (2 + M ), where M << N << H. Moreover, the number of operations for to calculate maximum of two delay distributions during runtime is reduced from log(N ) to log(M ). For small M , this method can be implemented in hardware. Clearly, this estimation method is faster and more hardware resource-efficient than the reference method. As we will show later, the estimation error of this approximation approach compared to the reference method is very small.
IV. SYNTHESIS OF DDROS
A. ILP Formulation
Given a delay sensitivity target (V ro ), we want to choose the number of each gate module in a DDRO, so that the delay sensitivities of the DDRO match the targeted delay sensitivities. is the nominal delay of candidate gate type h and s h is the integer variable that indicates the number of copies of gate type h in the DDRO. Y is the total number of gates allowed, z h is a binary variable that indicates whether gate h is inverting, and s inv is a positive integer variable. In our experiments, solving the ILP with the public-domain solver [12] , takes about one hour on a 3GHz single-core CPU.
Instead of minimizing the difference in relative delay sensitivity, the formulation in (13) minimizes the absolute delay sensitivity so that the objective function is linear in s h . This favors a solution with a smaller DDRO nominal delay, which may be suboptimal. To compensate this inherent bias in the ILP, we add a constraint to define the minimum DDRO delay. We then sweep the value of minimum DDRO delay at 10 evenly spaced intervals along its feasible range.
B. Practical Considerations
Selecting major variation sources: To identify major variation sources that affect delay sensitivity, we simulate a seven-stage RO using HSPICE, and perturb each variation source one at a time. Based on the results in Figure 5 , we can see that most of the variation sources have noticeable effect on the delay except for C gdl , C gdo and C jswg . Therefore, we only consider 12 out of the 15 major variation sources, summarized in Table II . We do not include second-order sensitivities to the variation sources because their magnitudes are very small. This assumption is supported by the experiment data in Figure 3 .
In our experimental setup, the impact of interconnect is modeled by parasitic resistance and capacitance extracted from design layout. However, we do not model interconnect as a variation source because its impact is relatively small compared to that of active devices [5] . If interconnect variations are to be included, the DDRO must be built with gate modules (see Figure 6 ) that are sensitive to interconnect variations. While this can be achieved by connecting the standard cells in gate modules with interconnects at higher metal layers, in such a case DDROs cannot be measured at an early manufacturing stage, making short-loop process monitoring infeasible. Characterizing gate sensitivity: Our ILP formulation in (13) assumes that delay sensitivity of a gate is insensitive to other gates connected before and after it. This is a key assumption that simplifies the problem. If we model V gate as a function of its adjacent gate type, the total number of variables and the design space become intractable.
To decouple the load and slew interaction between the gates, we introduce gate modules as basic building blocks for DDRO. A gate module is defined as several identical gates connected in series as illustrated in Figure 6 . Simulation results in Figure  7 show that the sensitivity difference due to different input slew and output load is reduced from 0.15% to 0.03%, as the number of stages in a gate module increases from 1 to 15. In this work, we use 5-stage gate modules as a tradeoff between stability of sensitivity and total area of a gate module.
Vdd enable node1 node2 node3 node4
Gate module
Connect to other modules Gate module To further reduce the effect of output load, we carefully select the candidate gate types such that each of them has similar gate capacitance. Since the interconnect is also important for path sensitivity, we use two types of interconnect lengths in building 0.03% 0.08% 0.15% Fig. 7 . Simulation results show that the sensitivities under different input slews {5ps, 50ps} and output loads {FO1, FO5} combinations converge as the number of stages in a gate module increases. our gate modules, i.e., the interconnect between consecutive gates in a module can be either short (5µm) or long (20µm). All interconnects in a gate module have the same length and gate modules with different interconnect lengths are considered to be of different instance types even if they have the same gate type. Extra input pins of a multi-input gate are assigned to high or low to make a gate module inverting or buffering (see Figure 6 ).
Extracting b ik and a xk :
Section III, represented V path and V max as linear combinations of V ro k , using b ik and a xk , respectively. The challenge in this step is to find the values of b ik (resp. a xk ) such that the resulting residue, V res path (resp. V res clust ), is minimized. It is important to note that critical path and DDRO delays have nonlinear dependence on parameters in Table II , and that they are subjected to process and environment variations. Thus, solving (5) and (10) using simple least-squares fitting can lead to large b ik (resp. a xk ) value, which may magnify the noise from DDRO. For example, Figure 8 (a) shows that solving (5) using a linear least-squares method (without constraints on b ik ) leads to little delay overestimation when we consider global variation only. However, this is not true when we repeat the experiment with global and local variations, as well as other variations that are absent in our delay model.
To reduce the impact of large b ik (resp. a xk ) values, we solve the extraction problem using linear programming and apply constraints to bound b ik (resp. a xk ). The optimization formulation that we use is min.:
Since delay sensitivity mismatch between the critical paths and
we use
Frobenius norm function in (14) to account for each entry in the matrix. We solve the problem using the optimizer in [25] , setting ub as 1 and lb as −0.5. Results in Figure 8(b) show that with the constraints in (14) , the delay estimation becomes less sensitive to circuit nonlinearity and other variations.
V. EXPERIMENTAL RESULTS
To validate our performance monitoring methodology, we synthesize, place and route three benchmark circuits using a (14) shows that linear model and HSPICE results are similar. This suggests that b ik must be extracted with constraints so that the mean delay overestimation is less sensitive to variations and nonlinearity. Figure 2 , we extract delay sensitivity of each critical path to each of the variation sources in Table II using HSPICE. Note that HSPICEbased sensitivity characterization is not mandatory in our design flow, and that it can be replaced by other methods (e.g., the statistical method in [20] ).
To evaluate the quality of our DDRO synthesis and delay estimation methodologies, we run Monte Carlo experiments on the critical paths and DDROs. Since each critical path is defined for a specific input and simulated independently, we cannot capture the correlation of local variation due to gate sharing. As an alternative, we run another set of Monte Carlo experiments using the linear model in (3) . In both simulations, we use the path and DDRO delay sensitivities extracted from HSPICE simulation results to minimize the discrepancy between them. In the linear model experiment, we sample the values of variation sources by using the Gaussian random number generator in Matlab [23] . In HSPICE simulation, we use the built-in Monte Carlo setup in the 45nm commercial device model. The number of trials in the Monte Carlo experiment is 1000 and 100 for the linear model and for HSPICE simulation, respectively. In all experiments, we set the user-specified confidence Z = 99%.
A. Simulation Results
Experiments using linear model:
The simulation results in Figure 9 and 10 show that our approximate delay estimation method achieves similar results compared to the reference method. The results also show that mean delay overestimations of all benchmark circuits decrease noticeably as the number of clusters increases from 1 to 12. This confirms our hypothesis that having multiple DDROs that correlate well with the critical paths can reduce chip delay overestimation. The results also show that delay overestimation is nonzero even when the number of DDROs = 12. This is because V res path and V clust are nonzero when we apply constraints in the b ik and a xk extractions. We further observe that the benefit of using multiple DDROs is more significant when the local variation is relatively less compared to the global variation. This is because replicalike monitors (e.g., PSRO, DDRO, PLL) can only replicate the impact of global variation on critical paths. If local variation dominates, more intrusive monitoring is required to measure the impact of local variation.
Based on the simulation results with global and local variations, minimum delay overestimations for the AES, M0 and MIPS testcases are 2.4%, 2.6% and 3.3%, respectively. Note that the values of minimum delay overestimation correlate with the clock period of the benchmark circuits (see Table III ), which is related to the magnitude of local variations. This suggests that the achievable minimum delay overestimation is limited by the local variation of a design. Therefore our performance monitoring method may be more suited for low-speed designs with longer critical paths that are less susceptible to local delay variations.
HSPICE Simulations:
The HSPICE results in Figure 11 are mostly similar to the linear model results. Several sources of inaccuracies contribute to the discrepancies between HSPICE and linear model results. First, our delay estimation does not account for nonlinearity in circuit delay. Although we have shown that the impact of nonlinearity is small (Figure 3 ), small errors from nonlinearity could be magnified by b ik or a xk . In other words, due to circuit nonlinearity, the delay estimation is sensitive to the extraction of b ik and a xk . For example, the MIPS benchmark circuit has a higher overestimation when the number of DDROs is 12. This is an artifact of the constraints in (14) , whereby a tighter constraint can reduce delay estimation quality (see Figure 12) . We leave understanding of the tradeoff between estimation quality and robustness for future work. Despite a user-specified confidence of 99%, the results show 2.5% and 5.3% of instances (chips) being underestimated in the linear model and HSPICE experiments, respectively. Since the results of the linear model experiment are free from nonlinearity error, the underestimation error is mainly due to the approximation in the statistical maximum function given by [18] . The HSPICE results have more underestimated instances because local variation is not modeled correctly, i.e., HSPICE simulates critical paths with uncorrelated local random variation but our delay estimation accounts for correlation between local variations. As a result, our delay estimates are slightly smaller than the path delays obtained from HSPICE simulation.
VI. CONCLUSION
In this paper, we have proposed methods to systematically design multiple DDROs, and to estimate circuit performance (chip delay) based on the measurements from the multiple DDROs. Our study shows that by using multiple DDROs we can reduce up to 25% (from 4% to 3%) of the mean delay overestimation of a design.
We also show that our delay estimation method can achieve similar results as the reference method with significantly less parameters. Therefore, our method is more amenable to hardware implemention.
We also observe that the benefit of using replica-like monitors (such as DDROs) is more significant when the local variation is relatively less compared to the global variation. If local variation dominates, then in-situ monitoring, though expensive, will fare better. With shrinking feature dimensions, increasing wafer sizes and changing device structures (e.g. fully depleted SOI, FinFETs), it is difficult to project which of the two components of variation is going to dominate in future technologies.
To verify the performance of DDRO and the proposed delay estimation approach, we have taped out a testchip using 45nm technology together with an ARM CORTEX M3 CPU. Ongoing work also addresses (1) the tradeoff between estimation quality and robustness during b ik and a xk extraction; and (2) silicon measurements from our testchip.
