Process variations in nanometer technologies are becoming an important issue for cutting-edge FPGAs with a multimillion gate capacity. Considering both die-to-die and withindie variations in effective channel length, threshold voltage, and gate oxide thickness, we first develop closed-form models of leakage and timing variations at the FPGA chip level. Experiments show that our models are within 3% from Monte Carlo simulation, and the leakage and delay variations can be up to 3X and 1.9X, respectively. We then derive analytical yield models considering both leakage and timing variations, and use such models to evaluate FPGA device and architecture under process variations. Compared to the architecture similar to a commercial FPGA and device setting from ITRS roadmap, device tuning alone improves leakage yield by 39% and architecture and device co-optimization increases leakage yield by 73%. We also show that LUT size 4 gives the highest leakage yield, LUT size 7 gives the highest timing yield, but LUT size 5 achieves the maximum combined leakage and timing yield. To the best of our knowledge, this is the first in-depth study on FPGA device and architecture co-evaluation considering process variations.
INTRODUCTION
Modern VLSI designs see a large impact from process variation as devices scale down to nanometer technologies. Variability in effective channel length, threshold voltage, and gate oxide thickness incurs uncertainties in both chip performance and power consumption. For example, measured variation in chip-level leakage can be as high as 20X compared to the nominal value for high performance microprocessors [1] . In addition to meeting the performance constraint under timing variation, dies with excessively large leakage due to such a high variation have to be rejected to meet the given power budget. There have been a few studies on parametric yield estimation considering both timing [2, 3] and leakage [4, 5] variations in ASICs. However, the parametric yield study for FPGAs is largely unexplored in the literature.
Existing FPGA architecture evaluation has considered performance, area, and power [6, 7, 8, 9] . [10] evaluated new FPGA architectures considering field programmable supply voltage including dual-Vdd and power-gating. A very recent work [11] showed that device and architecture co-optimization is able to obtain the largest improvement in FPGA performance and power efficiency. * This paper is partially supported by NSF CAREER award CCR-0093273, and NSF grant CCR-0306682. We used computers donated by Intel. Address comments to lhe@ee.ucla.edu.
However, all the evaluation work so far did not consider process variations.
In this paper, we first develop closed-form models of leakage and timing variations at the FPGA chip level with consideration of dieto-die and within-die variations. Experiments show that our models are within 3% from Monte Carlo simulation, and the leakage and delay variations can be up to 3X and 1.9X, respectively. In addition, it is also shown that leakage is more sensitive to within-die variation compared to inter-die variation, whereas timing is more sensitive to inter-die variation compared to within-die variation. We then evaluate FPGA device and architecture under process variations. Compared to the architecture similar to a commercial FPGA and device setting from ITRS roadmap, device tuning alone improves leakage yield by 39% and architecture and device co-optimization increases leakage yield by 73%. We also show that LUT size 4 gives the highest leakage yield, LUT size 7 gives the highest timing yield, but LUT size 5 achieves the maximum combined leakage and timing yield.
The rest of the paper is organized as follows. Section 2 presents background knowledge. Section 3 derives closed-form models for leakage and delay variations. Section 4 develops the leakage and timing yield models. Section 5 analyzes the leakage and timing yield rates, and Section 6 concludes the paper.
BACKGROUND
We assume the cluster-based island style FPGA same as the existing architecture evaluation work [8] - [11] . A logic block is a cluster of fully connected Basic Logic Elements that consists of one LUT and one flip-flop. The cluster size N and LUT size K are the architectural parameters to be evaluated. For simplicity, we assume a fixed routing architecture same as [11] , i.e., fully buffered routing switches and uniform wire segment spanning four logic blocks. We also optimize devices in terms of V dd and V th .
The above architecture and device co-optimization may easily have over hundreds device and architecture combinations. A runtime efficient trace-based estimation tool P trace is proposed to handle such co-optimization [11] . For a given benchmark set and a given FPGA architecture, statistical information of switching activity, critical path structure, and circuit element utilization are collected by profiling the placed and routed benchmark circuits. These statistical information is called the trace of the given benchmark set. Then, closed-form formulas are used to compute power and delay based on trace information and device parameters. P trace has a high accuracy compared to the detailed verification, where a circuit is placed and routed by VPR [6] and simulated by cycleaccurate power simulation Psim [10] .
In this paper, we consider the variation in threshold voltage (V th ), effective channel length (L ef f ), and gate oxide thickness (Tox). Similar to [4] where ASIC is assumed, each variation (∆P ) is decomposed into global (die-to-die) variation (∆Pg) and local (within-die) variation (∆P l ). We will extend P trace to consider the above variations and then perform device and architecture cooptimization with process variations.
LEAKAGE AND TIMING VARIATIONS

Leakage under Variation
We extend the leakage model in FPGA power and delay estimation framework P trace [11] to consider variations. In P trace, the total leakage of an FPGA chip is calculated as follows,
where N t i is the number of FPGA circuit elements in FPGA resource type i, i.e., an interconnect switch, buffer, LUT, configuration SRAM cell, or flip-flop, and Ii is the leakage of an element. Different sizes of interconnect switches and buffers are considered as different circuit elements.
The leakage current Ii of a circuit element i is the sum of the subthreshold and gate leakages:
Variation in I sub mainly sources from variation in L ef f and V th . Variation in Igate mainly sources from variation in Tox. Different from [4] that models subthreshold leakage and gate leakage separately, we model the total leakage current Ii of circuit element in resource type i as follows,
where In(i) is the leakage of a circuit element in resource type i in the absence of any variability and f is the function that represents the impact of each type of process variation on leakage.
The interdependency between these functions has been shown to be negligible in [4] . From SPICE simulation, we find that it is sufficient to express these functions as simple linear functions. To make the presentation simple, we denote ∆L ef f , ∆V th , and ∆Tox as L, V , and T , respectively. We can express these functions with this simple notation as follows,
where ci1, ci2, ci3 are fitting parameters decided by SPICE simulations. The negative sign in the exponent indicates that the transistors with shorter channel length, lower threshold voltage, and smaller oxide thickness lead to higher leakage current. We rewrite (3) as follows by decomposing L, V and T in to local (L l , V l , T l ) and global (Lg, Vg, Tg) components.
To extend the leakage model (1) under variations, we consider that each element has unique local variations but all elements in one die share the same global variations. Both global and local variations are modeled as normal random variables. The leakage distribution of a circuit element is a lognormal distribution. The total leakage is the sum of all lognormals. The state-of-the-art FPGA chip usually has a large number of circuit elements and therefore the relative random variance of the total leakage approaches zero. Same as [4] , we apply the Central Limit Theorem and use the mean of the distribution to approximate the distribution of the sum of lognormals. After integration, we can write the expression of the chip-level leakage as the follows,
where Si is the scale factor introduced due to local variability in L, V , and T . IL g ,Vg ,Tg (i) is the leakage as a function of global variations. σL l , σV l and σT l are the variances of L l , V l , and T l , respectively. For an FPGA architecture with power-gating capability, an unused circuit element can be power-gated to reduce leakage power. In this case, P trace calculates the total leakage current as follows,
where N u i is the number of used circuit elements in FPGA resource type i and αgating is the average leakage ratio between a powergated circuit element and a circuit element in normal operation. Same as [11] , 1/300 is used for αgating in this paper. Similar to (6) , (7) can be easily extended to consider variations as follows,
where E[Ii] is still defined as in (6).
Timing under Variation
The performance depends on L ef f , V th , and Tox, but its variation is primarily affected by L ef f variation [4] . Below we extend the delay model in P trace to consider global and local variations of L ef f . The structure of the critical path for each benchmark is obtained for timing analysis. The path delay can be calculated as follows,
For circuit element i in the path, di(Lg, L l ) is the delay considering global variation Lg and local variation L l . Lg is the same for all the circuit elements in the critical path. Given Lg, we evenly sample a few (eleven in this paper) points within range of [Lg − 3σL l , Lg +3σL l ]. We then perform SPICE simulation to obtain the delay for each circuit element with these variations. As the delay monotonically decreases when L ef f increases, we can directly map the probability of a channel length to the probability of a delay and obtain the delay distribution of a circuit element. We assume that the local channel length variation of each element is independent from each other. Therefore, we can obtain the distribution of the critical path delay for a given Lg as follows by convolution operation,
YIELD MODELS
Leakage Yield
The leakage yield is calculated on a bin-by-bin basis where each bin corresponds to a specific value Lg. For a particular bin, the value Lg is constant. We can rewrite (6) for chip-level leakage current as follows,
where Ai is the leakage current for all circuit elements of resource type i at a value of Lg and includes the scale factor Si due to the local variability. Let Xi be the leakage consumed by the elements of resource type i and it is a lognormal variable. The chip-level leakage current I chip is the sum of each lognormal variable Xi [4] and it can be expressed as follows,
Same as [4] , we model I chip , the sum of the lognormal variables Xi, as another lognormal random variable. The lognormal variable Xi shares the same random variables σV g and σT g , and therefore these variables are dependent of each other. Considering the dependency, we calculate the mean and variance of the new lognormal I chip as follows,
]} (13)
where the mean of I chip , µI chip , is the sum of means of Xi and the variance of I chip , σI chip , is the sum of variance of Xi and the covariance of each pair of Xi. The covariance is calculated as follows,
We then use the method from [4] to obtain the mean and variance (µN,I chip , σN,I chip 2 ) of the normal random variable corresponding to the lognormal I chip . As the exponential function that relates the lognormal variable I chip with the normal variable I N,chip is a monotone increasing function, the CDF of I chip can be expressed as follows using the standard expression for the CDF of a lognormal random variable,
where erf () is the error function. Given a leakage limit Icut for I chip , [CDF (Icut)×100%] gives the leakage yield rate Y leak (Icut|Lg), i.e., the percentage of FPGA chips that is smaller than Icut in a particular Lg bin. Similarly, the yield for the FPGA chip with power-gating capability can be easily calculated using (8).
Timing Yield
The timing yield is again calculated on a bin-by-bin basis where each bin corresponds to a specific value Lg. We further consider local variation of channel length in timing yield analysis. Given the global channel length variation Lg, (10) gives the PDF of the critical path delay D of the circuit. We can obtain the CDF of delay, CDF (D|Lg ), by integrating for a given Lg. Given a cutoff delay (Dcut) and Lg, CDF (Dcut|Lg) gives the probability that the path delay is smaller than Dcut considering L ef f variations. However, it is not sufficient to only analyze the original critical path in the absence of process variations. The close-to-be critical paths may become critical considering variations and an FPGA chip that meets the performance requirement should have the delay of all paths no greater than Dcut.
We assume that the delay of each path is independent and we can calculate the timing yield for a given Lg as follows,
where CDFi(Dcut|Lg) gives the probability that the delay of the i th longest path is no greater than Dcut. In this paper, we only consider the ten longest paths, i.e., n = 10 because the simulation result shows that the ten longest paths have already covered all the paths with a delay larger than 75% of the critical path delay under the nominal condition. We then integrate Y perf (Dcut|Lg) to calculate the performance yield Y perf as follows,
Leakage and Timing Combined Yield
To analyze the yield of a lot, we need to consider both leakage and delay limit. Given a specific global variation of channel length Lg, the leakage variability only depends on the variability of random variable Vg and Tg as shown in (6) , and the timing variability only depends on the variability of random variable L l . Therefore, we assume that the leakage yield and timing yield are independent of each other . The yield considering the imposed leakage and timing limit can be calculated as follows,
LEAKAGE AND TIMING YIELD ANALYSIS
For the total power and leakage power we report the arithmetic mean of 20 MCNC benchmarks within and among three FPGA architecture classes: Homo-Vt is the conventional FPGA using the same and optimized Vt for both logic blocks and interconnect; Hetero-Vt optimizes Vt separately for logic blocks and interconnect; and Homo-Vt+G is the same as Homo-Vt except that unused logic blocks and interconnect are power-gated as studied in [10] . We assume 10% of the nominal value as 3σ for all the process variations. Figure 1 shows the full chip leakage power simulated by Monte Carlo simulation and σ, in the presence of inter-die and intra-die variations. Leakage may change significantly due to process variations. When there is a ±3σ inter-die variation of L ef f , the leakage power has a 3X span. When no variation is present, there is still a 2X span in leakage power due to within-die variation. Clearly, leakage is more sensitive to within-die variation compared to interdie variation. Therefore it is important to consider the impact of process variations on leakage when determining the yield.
Leakage Yield
We further validate our chip-level analytical model for leakage by Monte Carlo simulation to estimate the full chip leakage power in Table 1 , where global variations are all set to ±3σ, and local variations are set to 0, ±1σ, and ±2σ. The mean calculated from our analytical method has a less than 3% difference from the simulation and the standard deviations differed by 1% of the mean value. In the rest of the paper, we always report the standard deviation as a relative value with respect to the mean and use our analytical model to calculate the yield. 
Impact of Architecture and Device Tuning
In this section we consider combinations of device and architecture parameters, called as hyper-architecture (in short, hyperarch). Table 2 shows the yield, mean leakage, and standard deviation from two different device settings, sorted by the yield. Columns 1-4 use ITRS device setting. Our baseline FPGA has N = 8 and K = 4, which is the architecture used by Xilinx Virtex-II Pro. Yield is calculated using the nominal leakage of each architecture plus an offset of 30% of the nominal leakage of baseline architecture, P L base , as the leakage limit. As shown in column 1 of Table 2 , the yield ranges from 24% to 70%, which shows that architecture tuning has a significant impact on the yield. Among all architectures, N = 6 and K = 5 gives the maximum yield, which is 12% higher than the baseline. The yield is affected by both the mean and variance. When the mean leakage is close to the leakage limit, the variance gains importance in determining the yield. However, when the mean is not close to the limit, the variance does not have that much impact on the yield. In this case, the lower the mean leakage is, the higher the yield is (see columns 5 − 8). It is also noticeable that larger LUT sizes have larger mean leakage, thus yield becomes smaller. Device tuning also affects the yield. In Columns 5 − 8 of Table 2 , we use a device setting that provides the minimum energy-delay product (minimum product of energy per clock cycle and critical path delay, in short, min-ED) given in [11] . Column 5 shows that optimizing Vdd and Vt can increase the yield rate of each architecture by an average of 39%. Therefore, device tuning has a great impact on yield rate and it is important to evaluate different Vdd and Vt levels while considering process variations. Comparing the yield of architecture (12, 7) in ITRS device setting and architecture (6, 4) in Min-ED device setting shows that combining device tuning with architecture tuning can increase the yield by up to 73%. From the Table, architectures with K=4 generally provides the highest yield rate, and they have the minimum area as reported in previous work such as [11] . In the rest of the paper, we will only consider dominant architectures. Dominant architectures are defined as the group of architectures that either has smaller delay or less energy consumption than others [11] . Fig 2 presents the energy and delay tradeoff between dominant architectures assuming Homo-Vt class. It has been shown that heterogeneous-Vt and power-gating may have great impact on energy delay tradeoff [11] . Here we further consider the impact of heterogeneous-Vt on the yield by comparing Homo-Vt and Hetero-Vt in min-ED device setting. Table 3 shows the results of the dominant architectures in all classes. The average yield for each class is presented in the last row of the table. Comparing the yield of Homo-Vt and Hetero-Vt, we can see that the average yield is improved by 5% via applying different Vt for logic blocks and interconnect. Therefore, introducing heterogeneous-Vt could improve yield with no or little area increase (due to an increase in doping well area).
Impact of Heterogeneous-Vt and Power-gating
Furthermore, power-gating can be applied to unused FPGA logic blocks and interconnect to reduce leakage power. As only one sleep transistor is used for one logic block, we use a 210X PMOS as the sleep transistor for each logic block. For interconnect, the area overhead associated with sleep transistors is more significant. We therefore use a 2X PMOS as the sleep transistor for each interconnect switch. Comparing the yield of Homo-Vt and Homo-Vt+G in Table 3 , applying power-gating can improve the yield by 8%. Comparing the yield of Hetero-Vt and Homo-Vt+G, power-gating can obtain more yield improvement than heterogeneous-Vt at the cost of chip-level area overhead between 10% to 20%. As leakage power can be greatly reduced by power-gating, little benefit can be introduced by applying simultaneous heterogeneous-Vt and power-gating, and we will not present the results here. Again, with heterogeneous-Vt or power-gating, LUT size K=4 is the best for leakage yield rate.
Timing Yield
For timing yield analysis, we only analyze the delay of the largest MCNC benchmark clma. Similarly, the timing yield is often studied using selected test circuit such as ring oscillator for ASIC in the literature. Figure 3 shows the delay with intra-die and inter-die channel length variation at baseline architecture (8, 4) with ITRS device setting. As shown in the figure, there is a 1.9X span with ±3σ Lg variation, and a 1.1X span without Lg variation. Clearly, delay is more sensitive to inter-die variation than within-die variation. This is because of the independence of local L ef f variation between each element. Therefore the effect of within-die L ef f variation tends to average out when the critical path is long enough.
For timing yield, we discard dies with critical delay larger than the cutoff delay, which is 1.1X of the nominal critical path delay of each architecture. Table 4 shows the delay yield of Homo-Vt+G.
One can see from this table that a larger LUT size will give a higher yield rate. This is because a larger LUT size generally gives a smaller mean delay with a shorter critical path (see Fig 2) , i.e., smaller number of elements in the path, which leads to a smaller variance. Therefore, a larger LUT size leads to a higher timing yield. As the timing specification may be relaxed for certain applications that are not timing-critical, the cutoff delay may be relaxed in this case. In this Figure 4 presents the leakage and delay variation for the baseline case using Monte Carlo simulation with P trace. It can be seen that a smaller delay leads to a larger leakage in general. This is because of the inverse correlation between circuit delay and leakage. A device with short channel length has a small delay and consumes large leakage, which may lead to a high leakage. To calculate the leakage and delay combined yield, we set the cutoff leakage as the nominal leakage plus 30% that of the baseline, while the cutoff delay is 1.2X of each architecture's nominal delay. Table 5 presents the combined yield for Homo-Vt with ITRS device setting and all classes with min-ED device setting. The area overhead introduced by power-gating is also presented in the table.
Leakage and Timing Combined Yield
Comparing Homo-Vt with ITRS device setting and min-ED device setting, the combined yield is improved by 21%. Comparing the classes using min-ED device setting, Hetero-Vt has a 3% higher yield than Homo-Vt due to heterogeneous-Vt while Homo-Vt+G has a 8% higher yield than Homo-Vt due to power-gating. HomoVt+G has the highest combined yield with an average of 16% area overhead. Device tuning and power-gating improve yield by 29% comparing Homo-Vt+G with min-ED setting to Homo-Vt with ITRS setting. This table also shows that architectures with LUT size 5 gives the highest yield within each class. This is because it has both a relatively high leakage yield as well as timing yield.
CONCLUSIONS AND DISCUSSIONS
In this paper, we have developed efficient models for chip-level leakage variation and system timing variation in FPGAs. Experiments show that our models are within 3% from Monte Carlo simulation, and the leakage and delay variations can be up to 3X and 1.9X, respectively. In addition, leakage is more sensitive to within-die variations compared to die-to-die variations, but timing is more sensitive to die-to-die variations. We have shown that architecture and device tuning has a significant impact on FPGA parametric yield rate. LUT size 4 has the highest leakage yield, 7
