Abstract-Carbon nanotube field-effect transistors (CNFETs) are promising candidates for building energy-efficient digital systems at highly scaled technology nodes. However, carbon nanotubes (CNTs) are inherently subject to variations that reduce circuit yield, increase susceptibility to noise, and severely degrade their anticipated energy and speed benefits. Joint exploration and optimization of CNT processing options and CNFET circuit design are required to overcome this outstanding challenge. Unfortunately, existing approaches for such exploration and optimization are computationally expensive, and mostly rely on trial-and-error-based ad hoc techniques. In this paper, we present a framework that quickly evaluates the impact of CNT variations on circuit delay and noise margin, and systematically explores the large space of CNT processing options to derive optimized CNT processing and CNFET circuit design guidelines. We demonstrate that our framework: 1) runs over 100× faster than existing approaches and 2) accurately identifies the most important CNT processing parameters, together with CNFET circuit design parameters (e.g., for CNFET sizing and standard cell layouts), to minimize the impact of CNT variations on CNFET circuit speed with ≤5% energy cost, while simultaneously meeting circuit-level noise margin and yield constraints.
I. INTRODUCTION
W HILE physical scaling of silicon-based field-effect transistors has improved digital system performance for decades [10] , continued device scaling is becoming increasingly challenging [2] . Carbon nanotube (CNT) field-effect transistors (CNFETs) are excellent candidates for continuing to improve both performance and energy efficiency of digital systems [13] . CNFET-based very large-scale integrated (VLSI) digital systems are projected to improve energy-delay product (EDP) by an order of magnitude versus silicon-CMOS [6] , [46] . Furthermore, CNFETs provide an exciting opportunity to enable monolithic 3-D integrated circuits [47] , leading to additional EDP benefits for CNFET-based digital systems with massive integration of logic and memory [42] .
The schematic of a CNFET is shown in Fig. 1 . Multiple CNTs compose the transistor channel, whose conductance is modulated by the gate. The gate, source, and drain are defined using traditional photolithography, while the CNT-CNT spacing is determined by the CNT growth [31] and can therefore exceed the minimum lithographic pitch. For high drive current, the target CNT-CNT spacing is 4-5 nm [46] .
Despite demonstrations of sub-10 nm channel length CNFETs [13] and stand-alone CNFET circuit elements [5] , [7] , [11] , realization of complex CNFETbased digital systems had been prohibited by substantial imperfections inherent to CNTs: mis-positioned CNTs and metallic CNTs. Mis-positioned CNTs cause stray conducting paths that can lead to incorrect logic functionality, and metallic CNTs (resulting from the imprecise control over CNT properties) result in increased leakage current and can lead to incorrect logic functionality. A unique combination of CNT processing and CNFET circuit design techniques, known as the imperfection-immune paradigm [54] , overcomes these challenges in a VLSI-compatible manner to enable the realization of the first CNFET-based digital systems [32] , [33] , [40] , including the first programmable microprocessor built using CNFETs [39] . Two key enablers of these demonstrations are: 1) mis-positioned CNT-immune layout design [30] and 2) VLSI-compatible metallic CNT removal (VMR), which efficiently removes ≥99.99% of metallic CNTs [32] , [40] .
Unfortunately, process variations specific to CNTs, such as the imprecise control over CNT properties and the nonuniform density of grown CNTs (details in Section II), can lead to significantly reduced circuit yield, increased susceptibility to noise, and large variations in CNFET circuit delays (Section II) [54] . One method to counteract these effects is to upsize all CNFETs. However, such naïve upsizing incurs large energy and delay costs that diminish CNFET technology benefits.
Rather, various CNT process improvement options, when combined with CNFET circuit design, provide an 0278-0070 c ⃝ 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
energy-efficient method of overcoming CNT variations. Without such strategies, CNT variations can degrade the potential speed benefits of CNFET circuits by ≥20% at sub-10 nm nodes, even for circuits with upsized CNFETs to achieve ≥99.9% yield (Section II). By leveraging CNT process improvements, together with CNFET circuit design, the overall speed degradation can be limited to ≤5% with ≤5% energy cost while simultaneously meeting circuit-level noise margin and yield constraints [52] .
However, co-optimization of CNT technology options and CNFET circuit design parameters using trial-and-error-based search can be prohibitively time-consuming. In this paper, we demonstrate a systematic and VLSI-scalable methodology that selects effective combinations of CNT processing options and CNFET circuit design techniques to overcome CNT variations. Our key contributions are as follows.
1) Techniques to quickly evaluate the impact of CNT variations on circuit yield, susceptibility to noise, delay, and energy. They run >100× faster than previous approaches. 2) A systematic methodology to explore the large space of CNT processing options together with CNFET circuit design parameters (e.g., CNFET sizing and standard cell layouts leveraging CNT correlation, see Section II), to rapidly identify designs that reduce the impact of CNT variations on circuit yield, susceptibility to noise, and delay variations with ≤5% energy cost. This is in sharp contrast to previous trial-and-error-based approaches. 3) Derivation of guidelines for CNT processing and CNFET circuit design parameters at highly scaled technology nodes to overcome CNT variations. We provide guidelines to limit the overall circuit speed degradation to ≤5% with ≤5% energy cost while maintaining ≥99.999% functional circuit yield and ≤0.001% probability of failing to meet circuit-level noise margin requirements (Section IV). In Section II, we present an overview of CNT variations and their impact on CNFET circuits. Section III describes a methodology to optimize circuit performance in the presence of CNT variations, leveraging a SPICE-compatible CNFET device model to build efficient variation-aware models for the delay, energy, and noise margin of CNFET circuits. Using this methodology, we provide CNT processing and CNFET circuit design guidelines for overcoming CNT variations at the 14, 10, 7, and 5 nm technology nodes (Section IV).
An earlier version of this paper was published in [16] . Here, we present the following additional contributions.
1) Design and analysis of CNFET digital VLSI circuits scaled to the 5 nm node, enabled by a recently developed SPICE-compatible CNFET device model for accurate analysis of sub-10 nm gate length CNFETs [23] . 2) A computationally efficient technique to numerically calculate the probability that CNFET circuits fail to meet circuit-level noise margin requirements. This technique can accurately compute such probabilities less than 0.001% (as is desirable for VLSI-scale circuits, details in Section II-C). In this paper, we make references to [17] , which contains additional figures and analysis details. It is available for download at http://www.arxiv.org. . IDC = 0.50 [49] for CNT density variations, p m = 33% [37] and p Rs = 4% [40] for m-CNT-induced variations. Diameter is normally distributed with µ d = 1.3 nm and σ d = 0.1 nm [31] . Alignment and doping distribution details in [54] . To analyze I ON variations attributed to individual sources of CNT variations, all other sources of CNT variations are removed. Additional parameters in [17, II. CNT VARIATIONS In addition to process variations that exist for silicon-CMOS FETs (e.g., variations in channel length, oxide thickness, and threshold voltage [26] ), CNFETs are also subject to CNT-specific variations, including variations in CNT type (semiconducting: s-CNT or metallic: m-CNT) [32] , CNT density [49] , diameter [34] , alignment [30] , and doping [9] (details in [17, Sec. VI] ). While the on-current (I ON ) of a CNFET with only a single CNT as its channel is highly sensitive to CNT diameter variations [34] , CNFETs in practical VLSI circuits consist of multiple CNTs to provide sufficient I ON . Thus, the impact of diameter variations is reduced due to statistical averaging (Fig. 2) [35] . Rather, I ON variations are dominated by variations in the CNT count: the number of sCNTs in a CNFET (after m-CNT removal, e.g., using VMR) 1 [52] . CNT count variations stem from two sources. 1) CNT Density Variations: Precise positioning of CNTs is difficult to control; resulting CNT-CNT spacing variations lead to a variable number of CNTs in each CNFET [49] . 2) m-CNT-Induced Variations: Each CNFET contains a variable number of both s-CNTs and m-CNTs, resulting in CNT count variations even assuming a perfectly selective m-CNT removal technique (i.e., p Rm = 100%, p Rs = 0%: Table I ). In addition, m-CNT removal techniques may inadvertently remove a small fraction of s-CNTs, further contributing to CNT count variations [54] . CNT count variations are parameterized by the parameters: Index of Dispersion for CNT count (IDC), p m , p Rs , and p Rm (i.e., the processing parameters) defined in Table I . We analyze the impact of CNT count variations on CNFET circuit 1 Another technique for post-growth m-CNT removal is known as CNT sorting, in which s-CNTs are separated from m-CNTs in a solution [1] . However, CNT sorting techniques have not yet achieved the selectivity required for VLSI-scale digital circuits [50] . modules synthesized from the processor core of OpenSPARC T2, a large multicore chip that closely resembles the commercial Oracle/SUN Niagara 2 system [27] . These OpenSPARC modules consist of ∼4 K to >100 K logic gates (Table III) and expose several effects in VLSI-scale circuits (e.g., wire parasitics) that are not visible in small circuit benchmarks. We consider the effects of CNT count variations on the following circuit-level metrics.
1) Functional Yield: Due to CNT count variations, there is nonzero probability that a CNFET contains no s-CNTs in its channel, leading to functional failure of the CNFET (i.e., CNT count failure) [51] . The count-limited yield of a CNFET circuit is the probability that no CNFET experiences CNT count failure [51] (Section II-A). 2) Delay Penalty: The increase in the 95-percentile-delay (T 95 : the minimum clock period that the circuit has a 95% probability of meeting) relative to the nominal delay (the critical path delay when there are no variations). Details in Section II-B.
3) Static Noise Margin (SNM):
A measure of the noise susceptibility of a pair of connected logic gates (Section II-C).
4) Probability of Noise Margin Violation (PNMV):
The probability that any pair of connected logic gates in a circuit fails to meet SNM R , a required SNM level (Section II-C).
A. Impact on Circuit Functional Yield
For VLSI CNFET circuits with minimum-width CNFETs, the count-limited yield can be very low (near zero) [51] . An effective method to significantly improve the count-limited yield (≥99.999%) is to perform minimum-width upsizing: upsize all CNFETs that have width (W) less than a specified minimum width (W MIN ) to have W = W MIN [51] . Although minimum-width upsizing effectively improves count-limited yield, it can incur large energy costs if the CNT count failures of all CNFETs are independent [51] . Rather, for CNFET circuits with highly aligned CNTs, the count-limited yield (and the energy cost of minimum-width upsizing, details below) can be significantly improved by leveraging the unique property of CNT correlation: since CNTs are 1-D nanostructures with lengths typically much longer than the CNFET contacted gatepitch [20] , [31] , the CNT counts of CNFETs can be uncorrelated or highly correlated depending on the relative physical placement of the CNFET active regions (active region: area of channel which has CNTs) [51] . Special aligned-active layouts can engineer these correlations by aligning the active regions in a library to maximize correlation [17, Fig. 15 ]. Aligned-active layouts incur minimal area increase (only 4 of 134 cells from the Nangate 45 nm Open Cell Library [25] incur area penalties <14%), and the locations of I/O pins are mostly retained, resulting in negligible impact on intercell routing [51] .
To achieve count-limited yield ≥99% for circuits today (which can consist of 100M logic gates), the countlimited yield for each OpenSPARC module (∼100K logic gates) should be ≥99.999%. To reach this target, we use a combination of minimum-width upsizing, aligned-active layouts, and CNT process improvements. We first use minimum-width upsizing with aligned-active layouts to achieve count-limited yield ≥99.9% (which is lower than . Energy cost of minimum-width upsizing with aligned-active layouts to achieve ≥99.9% count-limited yield: OpenSPARC modules, IDC = 0.50, p m = 10%, p Rs = 4%, p Rm = 99.99% (count-limited yield improves to ≥99.999% with the processing guidelines in Section IV). Improving delay penalty and PNMV can require additional energy costs. the 99.999% requirement, details below). Then, CNT process improvements (which are required to meet delay penalty and noise margin requirements) further improve the count-limited yield, e.g., to ≥99.999% (details below in steps 1-3). We define E FUNC as the energy cost (in terms of total energy per cycle) of minimum-width upsizing to reach a desired count-limited yield (i.e., functional yield). E FUNC can be ≤2.5% for all the OpenSPARC modules (Fig. 4) . It is determined using the design flow in Fig. 3 . Steps 1-3 ( Fig. 3) 
While (E NomOpt , T NomOpt ) represents an attractive design in the nominal case (since EDP NomOpt is small versus other points on the energy-delay tradeoff curve), this design may have a high delay penalty due to CNT variations (e.g., it can be ≥20% at sub-10 nm nodes: Section II-B).
B. Impact on Circuit Delay Variations
To derive distributions of CNFET circuit delays resulting from CNT count variations, we leverage the methodology described in [52] . This is a Monte Carlo statistical static timing analysis (MC SSTA) approach with two key changes: 1) a variation-aware timing model for CNFET logic gates (built using a CNFET device model [45] ) and 2) highly efficient CNT count sampling, based on the unique asymmetric CNT correlation property (Section II-A). This allows us to compute the delay penalty for each OpenSPARC module (after steps 1-3 in Fig. 3 ) as follows: sample the delay distribution via MC SSTA (using 2000 trials, excluding any trials that have CNT count failure), then extract T 95 from the delay distribution to calculate the delay penalty [ Fig. 5(a) ]. Fig. 5(a) illustrates that the delay penalty for the OpenSPARC modules can be ≥20% for EDP-optimized designs with aligned-active layouts at highly scaled technology nodes.
To overcome CNT variations, we target delay penalty ≤5% with total energy per cycle cost E ≤ 5% [relative to E NomOpt (1)] to maintain ≥90% of the projected EDP benefits of CNFET circuits, even in the presence of CNT variations. To improve delay penalties we leverage the selective upsizing approach described in Section II-A [52] . Fig. 5(b) shows that selective upsizing can reduce delay penalties by 1.5× (e.g., from 17% to 11% for the "gkt" OpenSPARC module); in Fig. 5(b) , additional selective upsizing was performed after steps 1-3 in Fig. 3 by increasing k SelUpsize to minimize the delay penalty subject to E ≤ 5%.
C. Impact on Circuit PNMV
A common metric to quantify the noise susceptibility of a pair of connected logic gates [i.e., a gate pair: (G (dr) , G (ld) ), where G (dr) and G (ld) are the driving and loading logic gates, respectively] is the SNM, which can be quantified as follows [using the gate pair shown in Fig. 6 (ld) [mirrored in Fig. 6(b) ]. Then for the gate pair (G (dr) , G (ld) ), the high SNM (SNMH), the low SNM (SNML), and the SNM are defined in (2)-(4), respectively [48] SNMH G (dr) ,
) is sensitive to I ON variations [48] , and so it is sensitive to CNT count variations. To quantify the impact of SNM variations on circuit noise susceptibility, we use the PNMV metric, which is the probability that any gate pair in a circuit fails to meet a required SNM level: SNM R . SNM R is a design constraint chosen by the designer and PNMV is directly related to SNM R . As SNM R increases (tighter SNM requirement) then PNMV increases (lower probability of meeting the SNM requirement). Typical values of SNM R are relative to the supply voltage, V DD (e.g., SNM R = V DD /5 [48] ). PNMV is defined in (5) , where C is the set of all gate pairs
To solve for PNMV due to CNT count variations, we leverage a variation-aware SNM model that can compute SNMH(G (dr) , G (ld) ) and SNML(G (dr) , G (ld) ) for every gate pair in a circuit, given the CNT counts of each CNFET contained in G (dr) and G (ld) (details in Section III-B1). In Section III-B2, we describe how to combine this variation-aware SNM model and the distributions of CNT count for all CNFETs in the circuit to efficiently calculate PNMV. Fig. 3 ), which can be nearly 100% at the 5 nm node. To achieve PNMV ≤ 1% for circuits today (with ∼100M logic gates), each OpenSPARC module (∼100K logic gates) should have PNMV ≤ 0.001%.
Since minimum-width CNFETs are highly sensitive to CNT count variations [52] , gate pairs that contain minimum width CNFETs are highly likely to cause SNM violations. Thus, PNMV is highly sensitive to minimum-width CNFETs, so further minimum-width upsizing (in addition to minimum-width upsizing for count-limited yield) improves PNMV (via statistical averaging) at the cost of energy [ Fig. 7(c) ]. However, additional minimum-width upsizing may be undesirable as it can require E > 5%, can increase circuit delay, and is not guaranteed to meet PNMV constraints [17, Sec. IX-A].
D. Overcoming CNT Variations
As shown above, CNFET upsizing techniques alone can be insufficient to meet design goals (e.g., delay penalty ≤5% and PNMV ≤ 0.001% with E ≤ 5%) [54] . Rather, a combination of CNT processing and CNFET circuit design is required [54] , but two key questions must be answered: 1) which processing parameters to improve? 2) By how much?
Without a systematic methodology to evaluate the circuitlevel impact of CNT variations, one might blindly pursue difficult CNT processing paths with diminishing returns, while overlooking other processing parameters that enable larger performance gains. For example, much research has focused solely on improving p m [1] . However, reducing p m past 1% suffers from diminishing returns and can be insufficient to meet design goals [16] , [54] (e.g., in Fig. 16 in [17, Sec. VII]: p m = 0.1% does not achieve delay penalty ≤5%).
Previously, co-optimization of processing and design has been performed via a trial-and-error-based approach [52] . However, this can be prohibitively time-consuming, potentially requiring months of simulation time (details in Section IV). In Section III, we present a methodology that efficiently selects effective combinations of CNT processing options and CNFET circuit design techniques to overcome CNT variations. Illustration of two rows of standard cells that depicts the relationship between the sampling region CNT counts (e.g., n 1 , n 2 , . . . , n 12 ) and the CNT counts of each CNFET [53] . For example, the CNT count for CNFET P3,1 in inverter U3 is n 1 + n 2 + n 3 = 2 + 3 + 3 = 8.
III. RAPID CO-OPTIMIZATION OF PROCESSING & DESIGN
An existing approach to overcome CNT variations is based on brute-force trial-and-error [52] : a designer iterates over many design points (design point: a combination of values for the CNT processing parameters: IDC, p m , p Rs , p Rm , and the CNFET design parameters: e.g., k SelUpsize ), analyzing each one until a design point that satisfies a target delay penalty and target PNMV with small energy cost is found. Furthermore, this approach utilizes highly accurate yet computationally expensive models to calculate delay penalties and PNMV. It suffers from two significant bottlenecks.
1) The time required to calculate delay penalties and PNMV limits the number of design points that can be explored.
2) The number of required simulations can be exponential in the number of CNT processing and CNFET design parameters. Our methodology overcomes these bottlenecks as follows. 1) We estimate delay penalties >100× faster than the previous approach and efficiently calculate PNMV ≤ 0.001%, enabling exploration of many more design points while maintaining sufficient accuracy to make correct design decisions (details in Section IV). 2) We use a gradient descent search algorithm, based on delay and PNMV sensitivity information with respect to the processing parameters, to systematically guide the exploration of design points (details in Section III-D).
A. Rapid Quantification of Circuit Delay Penalty
To quantify CNFET circuit delay variations, we leverage the probabilistic framework in [52] , which is based on an MC SSTA approach with two key enhancements.
1) Highly Efficient Sampling Method: It is not trivial to analytically model the effects of CNT correlation at the circuit level. We partition the circuit area in sampling regions, each of which has its own independent CNT count. The CNT count of each CNFET is then the sum of the CNT counts of each sampling region that it overlaps (example shown in Fig. 8 ) [53] . 2) Variation-Aware Timing Model: The drive current and parasitic capacitances of CNFETs are modeled as affine functions of the CNT counts in each sampling region [53] . [24] . Unlike in [53] , we fix the input slew rate of each logic gate to its nominal value. This allows us to efficiently compute all of the logic gate delays in a circuit simultaneously. These approximations have minimal impact on our design choices (Section IV). We refer to the model in [53] as the nonlinear timing model, and to the model described below as the linearized timing model. To formulate the delay model for the full circuit, let µ R and σ R be the mean and standard deviation of the sampling region CNT count distribution (µ R and σ R are functions of the processing parameters shown in Table I ). The first step to estimate the delay penalty of a design point is to sample the CNT count for each sampling region and for each MC trial. Each sample is one entry in a matrix N ∈ R r×n , where r is the total number of sampling regions and n is the total number of MC trials. We then compute the total capacitive load and drive current for each of the m gates (for each trial) via an affine transformation of the region CNT counts (based on the model in [53] ). We express this transformation in matrix form, where
Our delay models are fully specified by A CLoad , A IDrive ∈ R m×r and column vectors b CLoad , b IDrive ∈ R m , which contain the coefficients of the affine transformations from the sampling region CNT counts to the CNFET drive currents and parasitic capacitances [53] . Next, we factor out µ R and σ R , a crucial step in achieving computational efficiency. We rewrite
T + σ R X, where each element of X ∈ R r×n is distributed according to a unit Gaussian distribution, allowing (6)- (7) to be written as
Note that, 1 is a column vector with every element equal to 1, and multiplication of a matrix by a scalar (e.g., µ R or σ R ) indicates that each element in the matrix is multiplied by that scalar. Any product that does not contain µ R or σ R is independent of the processing parameters, and can therefore be precomputed. The dominant computational tasks are the matrix multiplications AX {which are O(mn) since A is sparse [12] }. Precomputing such terms (and factoring in the multiplication of C Tot and V DD ), yields equivalent expressions for total charge and drive currents that require scalar operations (see Table II for variable definitions)
Precomputing Q MC , q Exp , q Fix , I MC , i Exp , and i Fix (Table II) subsequently allows each logic gate delay to be efficiently computed with only two multiplications, one division, and three additions per trial [only counting operations in (12) that must be computed for each trial]. This includes the addition of d Fix ∈ R m , a vector of fixed delays (e.g., input delays from external circuits). The matrix division in (12) is element-wise
We then perform static timing analysis (STA) for each MC trial (and for the nominal case), and use the results to estimate T 95 and the delay penalty. The total circuit energy is computed using a model of the form E = (1/2)CV 2 [48] 
B. Rapid Quantification of Circuit PNMV Our method of analyzing circuit PNMV consists of two key components, each of which is described in this section.
1) A variation-aware SNM model, which computes V OH , V IH , V IL , and V OL (these terms are defined in Section II-C) as functions of the CNT counts of the CNFETs within a logic gate. This model can be used to compute SNM for every gate pair in the circuit. 2) A method to numerically calculate low PNMV values (e.g., ≤0.001%), given the variation-aware SNM model and given a network of cascaded logic gates (e.g., a circuit module after steps 1-3 in Fig. 3 ). This technique accounts for correlations in CNT count among CNFETs.
1) Variation-Aware Static Noise Margin Model:
We refer to V OH , V IH , V IL , and V OL as the VTC parameters, and we model them for each stage of cascaded logic. We distinguish logic stages from standard cells since a standard cell can consist of multiple logic stages (e.g., the standard cell BUF_X1 consists of two cascaded inverters, each of which is one logic stage). For standard cells with multiple logic stages, we model the VTC parameters separately for each logic stage (e.g., we consider the cross-coupled inverters in a D-latch as two separate logic stages). For consistency with the terminology in Section II-C (and without loss of generality), we assume that G (dr) and G (ld) in a gate pair each represent a single logic stage. We also define the state of a logic stage input or output as its logic value (0 or 1). A logic stage input is sensitized if the logic stage output depends on the state of that input (given the logic values of all the other inputs).
For each logic stage input in our standard cell library, we model the VTC parameters for every case in which that input is sensitized (considering all possible combinations of the other inputs). The VTC parameters are functions of the CNT counts of the p-and n-type CNFETs (there is a CNT count variable for each CNFET in the circuit) which: 1) are gated by that input and 2) connect the logic stage output to either V DD or ground through a series of CNFETs in the "on" state (see [17, Sec. IX-B] for an example). We define n P as the sum of the CNT counts of all such p-type CNFETs. We similarly define n N for the n-type CNFETs. For example, inverter U3 in Fig. 8 consists of a p-type CNFET (labeled "P3,1") and an n-type CNFET ("N3,1"). The CNT counts of P3,1 and N3,1 are n (P3,1) P and n
For the NAND2 gate U4 in Fig. 8 (as an example of a logic stage with multiple inputs), we separately model the VTC parameters for each input: in1 and in2. Since there are two sets of VTC parameters for the NAND2 gate and only one output, the worst-case values for the output levels V OH and V OL (which are modeled as being independent of the CNT count: details at the end of this section) are selected from the two sets of VTC parameters (i.e., so that SNM is the lowest, see [17, Sec. IX-B] for an example). We model the VTC parameters (V OH , V IH , V IL , and V OL ) as affine functions of log 10 (n P /n N ) (this model is shown for an inverter in Fig. 9 ), which achieves a root-mean-square (RMS) modeling error ≤2.5 mV in all cases (details in [17, Sec.
IX-C]). For each case, this affine function is represented by a real-valued matrix
To construct the full variation-aware SNM model (consisting of many instances of T in our standard cell library: one for each combination of input states that sensitizes each logic stage input), we perform two steps for each instance of T.
1) Sample the CNT count for each CNFET in the logic stage 2000 times [53] (using the distribution of CNT count, given the CNFET widths and the experimentally demonstrated processing parameter values in Table I) , and use SPICE simulations to obtain V OUT versus V IN . For each sample, record n P and n N and extract V OH , V IH , V IL , and V OL from the VTC in each simulation. 2) Find T via linear regression, given the recorded n P and n N and the extracted V OH , V IH , V IL , and V OL . We observed that in all cases, T VOH1 ≈ 0 and T VOL1 ≈ 0 (14), indicating that the CNT count ratio does not strongly affect the output levels of a logic stage. 2 Thus, to simplify our model, we set T VOH1 = 0 and T VOL1 = 0, and maintain RMS modeling error ≤2.5 mV in all cases (details in [17, Sec. IX-C]). In Fig. 9 , the VTC parameters are plotted versus log 10 (n P /n N ).
This variation-aware SNM model is critical for efficiently computing PNMV due to CNT count variations, as it relates the VTC parameters to the CNFET CNT counts for each logic stage (14) . However, solving (5) for PNMV (in Section II-C) is not trivial due to CNT correlation (Section II-A), which causes correlated SNM among gate pairs. In Fig. 8 , for example, gate pairs (U1, U3) and (U3, U5) have correlated SNM since the CNFETs in U3 and U5 have correlated CNT counts (they overlap the same sampling regions).
2) Full-Circuit PNMV Model: Here, we demonstrate how the variation-aware SNM model is used to efficiently calculate PNMV ≤ 0.001% (which is desirable for VLSI-scale circuits: Section II-C) without using an MC-based technique (which would require many trials: e.g., >10 5 since 0.001% = 1/10 5 ). There are two key aspects in our framework for computing PNMV.
1) PNMV Formulation:
We formulate PNMV as a function of the sampling region CNT count variables (which are independent) to account for the effects of CNT correlation (Section II-A) on SNM. 2) Solving the PNMV Formulation Efficiently: We provide a systematic technique to identify a small subset of all SNM constraints in the circuit [i.e., in (5) in Section II-C], referred to as the critical SNM constraints, which are the only SNM constraints that are required to compute PNMV. Due to CNT correlation, an SNM violation in a noncritical SNM constraint implies that there must also be an SNM violation in a critical SNM constraint; hence, the noncritical SNM constraint is not required to compute PNMV. For the OpenSPARC modules, <1% of all SNM constraints can be critical SNM constraints [17, Table VIII] . Hence, the time to compute PNMV (proportional to the number of SNM constraints) can improve by >100×. The first step to compute PNMV is to convert the SNM constraints in (5) into constraints on the CNT counts of each CNFET [using (2) , (3), and (5) in Section II-C]. For each gate Fig. 8 ], each SNMH constraint has the form in (15) and each SNML constraint has the form in (16) (there can be multiple SNMH and SNML constraints for a single gate pair, details below)
We then substitute the variation-aware SNM model (14) into these constraints, using T (dr) and T (ld) to represent the SNM model for G (dr) and G (ld) , respectively [e.g., T (dr) = T (INV_X1) and T (ld) = T (NAND2_X1-in1) for (U3, U4) in Fig. 8 ]
These constraints are equivalently expressed in matrix form
Note that, the vector inequality in (19) is element-wise (as are all vector inequalities in this section). To account for all SNM constraints in the circuit, let c be the total number of SNM constraints, and let t be the total number of CNFETs (each with its own CNFET CNT count variable, e.g., n (P3,1) P for CNFET P3,1 in Fig. 8 ). For every gate pair (G (dr) , G (ld) ), there is an SNMH constraint and an SNML constraint for each combination of input states that sensitizes the input to G (ld) that is driven by G (dr) . For example, if G (dr) drives an input of G (ld) that can be sensitized by three combinations of input states [e.g., input "A" of an "and-or-invert" logic stage with Boolean function: out = (A + (B * C)) ′ ], then there are three SNMH and three SNML constraints for that gate pair (which may constrain different CNT count variables).
The total number of SNM constraints in the circuit is c, and each one imposes a constraint on the CNFET CNT count variables (e.g., n Fig. 8 ). We can represent these c constraints with a single matrix inequality, by first defining a column vector s ∈ R t that contains the CNT count variables for all the CNFETs in the circuit (e.g., if the entire circuit consisted of the ten CNFETs shown in Fig. 8 
. Then by using all instances of T in the SNM model (14), we formulate each SNM constraint as a constraint on the vector s, using the same procedure as above to convert (15)- (16) into constraints on the CNT counts in (19) (example in [17, Sec. IX-D]). We express these constraints using a matrix H ∈ R c×t , such that satisfying (22) is equivalent to satisfying all SNM constraints in the circuit. [17, Table IX ] summarizes all variables in this section.
Hs ≼ 0.
Note that, 0 is a column vector with element entry equal to 0. See [17, Sec. IX-D] for the formulation of (22) for the example circuit shown in Fig. 8 . Since all SNM constraints in the circuit are represented in (22) (each of c rows in H represents a single SNM constraint), PNMV (5) is the probability that (22) is violated (i.e., PNMV = 1 − P{Hs ≼ 0}). However, solving for PNMV using (22) is not trivial since the CNT count variables (i.e., the elements of s) can be highly correlated due to CNT correlation (Section II-A). For example, in Fig. 8 , the active regions of CNFETs P3,1 and P5,1 are aligned, so their CNT counts (n (P3,1) P and n (P5,1) P ) are correlated. Thus, the SNM constraints on n (P3,1) P and the SNM constraints on n (P5,1) P are dependent. We can reformulate PNMV to efficiently account for CNT correlation by transforming the constraints in (22) (that constrain the CNFET CNT count variables, which are dependent) into constraints on the sampling region CNT count variables (which are independent). To do so, we first define a column vector n ∈ R r that contains the CNT count variables for all the sampling regions (e.g., in Fig. 8 , n = [n 1 ; n 2 ; n 3 ; n 4 ; n 5 ; n 6 ; n 7 ; n 8 ; n 9 ; n 10 ; n 11 ; n 12 ; . . . ]). To formulate (22) in terms of vector n instead of vector s (s: the CNFET CNT count variables), the relationship between n and s is required. We express this relationship as a linear transformation represented by a matrix B ∈ {0, 1} t×r (details below) such that
There is one row in B for each CNFET in the circuit, and one column for each sampling region. To determine B: if CNFET i (of t total CNFETs) overlaps sampling region j (of r total sampling regions), then the value of B in row i, column j is 1 (i.e., B i,j = 1); otherwise, B i,j = 0 (as an example, B for the circuit in (23) into (22), the SNM constraints can be expressed in terms of the region CNT count variables (instead of the CNFET CNT count variables), using a matrix
All SNM constraints in the circuit are represented in (25) [just as in (22)], so (25) can also be used to determine PNMV (i.e., PNMV = 1 − P{Kn ≼ 0}). The advantage of using (25) instead of (22) is that all the variables in n (the vector of sampling region CNT counts) are independent (unlike the correlated variables in s: the vector of CNFET CNT counts). For example, (25) can be used to estimate PNMV via an MC-based approach: for each trial, sample all elements of n (from the distribution of sampling region CNT count) and evaluate Kn. Then estimate PNMV as the fraction of trials that violate (25) .
However, evaluating every SNM constraint in (25) is unnecessary since many of them are noncritical SNM constraints (as described above, any SNM constraint that cannot be uniquely violated without simultaneously violating another SNM constraint is not required to compute PNMV). See [17, Sec. IX-E] for a detailed description, including examples, of how to systematically identify and eliminate all noncritical SNM constraints.
Eliminating these noncritical SNM constraints is crucial to improve computational efficiency, as they can account for ≥99% of all SNM constraints in the circuit (e.g., for the OpenSPARC modules [17, Table VIII] ). Since each row of K (25) represents an SNM constraint, we can remove the rows in K that correspond to noncritical SNM constraints to form K ∈ R p×r , where p is the number of critical SNM constraints Kn ≼ 0.
To further improve computational efficiency, we then factor out µ R and σ R from the sampling region CNT count variables n in (26) (just as we did for the full-circuit delay model in Section III-A), allowing us to quickly recompute PNMV after updating the processing parameter values (details in Section III-D). We rewrite n = µ R 1 + σ R x, where each entry of x ∈ R r is distributed according to a unit Gaussian distribution; then (26) becomes
In (27) , b is a p-dimensional vector of constants and the matrix-vector productKx is a p-dimensional vector of Gaussian random variables with covariance matrix
That is,Kx is distributed according to a multivariate normal (MVN) distribution with covariance matrix C; thus, PNMV = 1 − P{Kx ≼ (µ R /σ R )b} can be solved numerically using existing software packages for computing MVN probabilities (see [14] ). In particular, consider the cumulative distribution function (CDF) of the MVN-distributed matrixvector productKx (i.e., the MVNCDF); the probability that all SNM constraints are satisfied (i.e., 1 − PNMV) is equal to the value of the MVNCDF at the p-dimensional point (µ R /σ R )b
In [17, Sec. IX-F], we describe how to efficiently solve (31) by leveraging the property that many terms in the covariance matrix, C, are 0; e.g., for the OpenSPARC modules, PNMV can be computed in less than 10 seconds using a single 2.93 GHz processor core. In [17, Fig. 21 ], we validate the accuracy of (31) against MC simulations. For the MC approach, we first sample the vector of sampling region CNT counts [n in (25) ] for each trial. Then we estimate PNMV as the fraction of samples that violate (25) .
C. Circuit Performance Sensitivity to Processing Parameters
Our goal is to achieve small delay penalties and PNMV with small E. We quantify the tradeoff between total circuit energy [E Tot in (13)] and delay penalty using EDP 95 : defined in (32) . We also define the energy-PNMV-product (ENP) metric to quantify the tradeoff between E Tot and PNMV
While rapid computation of circuit delay penalty and PNMV overcomes the computation time bottleneck of analyzing a single design point, we still require a method for intelligently exploring the large space of CNT processing options. In general, a common measure of the sensitivity of an objective function (e.g., EDP 95 or ENP) with respect to each of its input variables (e.g., the processing parameters) is its gradient. The EDP 95 and ENP gradients are defined in (34)- (36) and are used to guide the exploration of processing options to improve delay penalties and PNMV (Section III-D). Fig. 10 illustrates a flowchart of the steps used to compute (32)- (36 
D. Guided Exploration to Overcome CNT Variations
To overcome the bottleneck of trial-and-error-based search (i.e., iterating over many combinations of values for the processing parameters and design parameters defined in Section II: IDC, p m , p Rs , p Rm , k SelUpsize ), we use a gradient descent-based strategy to systematically guide the improvement of EDP 95 and ENP in the presence of CNT variations (while gradient descent strategies can converge to local rather than global optima, in [17, Sec. X-C] we discuss techniques to reduce the impact of local optima during gradient descent in our methodology).
For any design point, we can use single design point analysis (SDPA: Fig. 10 ) to determine the sensitivity of each circuit performance metric (e.g., EDP 95 or ENP) to each processing parameter by computing its gradient (e.g., ∇EDP 95 or ∇ENP). These gradients can then be used to identify which processing parameters should be improved (and by how much) to efficiently improve the circuit performance metrics. Before describing the full gradient descent methodology, we define the initial design point as the design point after EDP optimization in the nominal case (i.e., after steps 1-3 in Fig. 3) . Also, we define the initial processing parameter values as the processing parameter values of the initial design point (e.g., IDC = 0.50, p m = 10%, p Rs = 4%, p Rm = 99.99%: Table I ). Starting from the initial design point, we first perform selective upsizing (as described in Section II-A), incrementally increasing k SelUpsize (the number of standard cells to upsize) to generate a set of design points referred to as the initial energy-delay tradeoff curve (the values of k SelUpsize are chosen so that each increase in k SelUpsize increases E by ∼1%-2% to identify multiple design points with E ≤ 5%). The full methodology, illustrated in Fig. 11 , combines selective upsizing and gradient descent to overcome the impact of CNT count variations on delay penalty and PNMV. After generating the initial energy-delay tradeoff curve via selective upsizing, our goal is to identify multiple design points that meet both a delay penalty constraint (e.g., delay penalty ≤5%) and a PNMV constraint (e.g., PNMV ≤ 0.001%) with minimal energy cost (e.g., E ≤ 5%). Such design points that simultaneously satisfy all these design goals are referred to as acceptable design points. Consequently, this is a feasibility problem in which we search for design points that meet two constraints, and we solve it using a variation of an alternating projections (AP) algorithm [3] . A typical AP algorithm iteratively projects a point onto multiple constraints until all are satisfied. In our methodology, we use gradient descent instead of projection; the full methodology (Fig. 11) is described below (example in Fig. 12 ).
1) Analyze the Initial Design Point: Perform SDPA (Fig. 10) on the initial design point [with initial processing parameter values and k SelUpsize set to minimize EDP NomOpt (1): i.e., after steps 1-3 in Fig. 3 ]. 2) Gradient Descent: Alternate between: 1) performing gradient descent steps using ∇EDP 95 until the delay penalty constraint is satisfied and 2) performing gradient descent steps using ∇ENP until the PNMV constraint is satisfied. This procedure continues until either: a) both constraints are satisfied simultaneously (i.e., an acceptable design point is found) or b) E is too large or the processing parameters are too constrained (e.g., a design point with E > 5% is reached, or the required processing parameter values may be difficult to achieve experimentally: both are design choices). 3) Selective Upsizing: Reinitialize the processing parameters to their initial values (thus returning to the initial Fig. 12 . Gradient descent methodology (Fig. 11 ) to achieve delay penalty ≤5%, PNMV ≤ 0.001% (for SNM R = V DD /6), E ≤ 5% (5 nm "pku" OpenSPARC module). Gradient descent paths descend from the initial energydelay tradeoff curve (IDC = 0.50, p m = 10%, p Rs = 4%, p Rm = 99.99%).
The point (delay penalty, E) = (0%, 0%) represents the EDP-optimized nominal design point (Section II-A, Fig. 3 ). E < 0 for point A since E Tot depends on the number of s-CNTs after m-CNT removal (i.e., the CNT count variables), as shown in (13) in Section III-A; due to CNT count variations (e.g., resulting from p m > 0%, p Rs > 0%), the total number of s-CNTs in all CNFETs can be reduced versus the nominal case (no variations).
energy-delay tradeoff curve) and then perform selective upsizing (by increasing k SelUpsize ) to move to the next point on the initial energy-delay tradeoff curve. If E from selective upsizing is too large (e.g., E > 5%), then proceed to step 4 (below). Otherwise, loop back to step 2 (gradient descent) to search for an additional acceptable design point. 4) Design Point Selection and Validation: Select a single design point from all acceptable design points identified using gradient descent. For example, the designer can select the acceptable design point with the minimum EDP 95 or with the most relaxed processing requirements (a design choice). Finally, highly accurate models (e.g., the nonlinear timing model) can be used to validate the selected design point (if all constraints are not satisfied during validation, then perform additional gradient descent steps until they are satisfied). Fig. 12 illustrates an example of the gradient descent-based methodology (Fig. 11) to meet delay penalty ≤5% and PNMV ≤ 0.001% with E ≤ 5%. Starting from point A (the initial design point), we perform selective upsizing to generate the initial energy-delay tradeoff curve (as described earlier in this section) represented by points A-F. Then, using the methodology in Fig. 11 , we perform gradient descent (starting from the initial design point: point A) until delay penalty ≤5% and PNMV ≤ 0.001% (at point G: an acceptable design point). Next, the processing parameters are reinitialized and then selective upsizing brings us to point B on the initial energydelay tradeoff curve. Again, gradient descent is performed to identify another acceptable design point (point H). This process repeats until we reach point F on the initial energy-delay tradeoff curve, which has E > 5%, concluding the search for acceptable design points.
In Fig. 12 , gradient descent has identified multiple acceptable design points with varying E and processing requirements. Furthermore, alternative sets of acceptable design points can be identified by adjusting the gradient descent step procedure: e.g., if IDC is difficult to improve (i.e., it is difficult to control CNT density variations experimentally), then the gradient descent step can be weighted toward larger updates in p m or p Rs , or can be forced never to update IDC past a predetermined hard-limit. These constraints can be provided as inputs, and are features of this flexible framework. II-A, Fig. 3 ). Design points with E < 0 are for designs with smaller CNFET widths.
IV. RESULTS
We present two sets of results to demonstrate that we have overcome the bottlenecks of brute-force trial-and-error-based approaches. The first set of results (Section IV-A) demonstrates that we can analyze a set of design points >100× faster than before, while maintaining sufficient accuracy to make correct design decisions. The second set of results (Section IV-B) demonstrates the ability of this gradient descent algorithm to identify multiple processing options to meet design goals (e.g., delay penalty ≤5% and PNMV ≤ 0.001% with E ≤ 5%) without exhaustive search. Using the results from gradient descent, we provide practical processing guidelines for each node and for multiple values of V DD such that: even in the presence of CNT variations, CNFET circuits can maintain ≥90% of the projected EDP benefits of nominal CNFET circuits.
A. Linearized Timing Model Validation
Here, we validate the speed and accuracy of the linearized timing model to analyze circuit delay variations. We first choose a set of design points that would typically be chosen by a designer seeking to optimize EDP 95 using a brute-forcesearch-based approach: we use the design points in [52] as a reference. We analyze 112 design points including all combinations of: eight unique sets of processing parameter values (the same as in [52] : see [17, Table VII ]), and 14 different k SelUpsize values (each increase in k SelUpsize increases E by ∼1%-5%, e.g., in Fig. 13 ; higher resolution requires more computation time).
After choosing the set of design points, we use the nonlinear timing model to compute EDP 95 of every design point and then select the design point with the best (minimum) EDP 95 . We also record the total required computation time. We then perform the same procedure using the linearized timing model. We evaluate: 1) the total computation time for each model and 2) the degradation (increase) in EDP 95 due to using the linearized timing model. To quantify this degradation, we use the EDP 95 sub-optimality metric defined in (37) Ideally, the same design point is selected using each of the two models (resulting in EDP 95,sub-opt = 0%, example in Fig. 13 ). This is the case for five of the eight OpenSPARC modules (5 nm node), and the other three have EDP 95,sub-opt ≤ 2% (Table III) . 3 The linearized model achieves >100× speed-up in all cases.
B. CNT Processing and CNFET Circuit Design Guidelines
We now demonstrate the effectiveness of the gradient descent methodology to identify multiple sets of guidelines for processing parameters (i.e., processing routes) that meet design goals for all OpenSPARC modules simultaneously. For each OpenSPARC module, we first perform gradient descent (Fig. 11 , with initial processing parameter values: IDC = 0.50, p m = 1%, p Rs = 4%, p Rm = 99.99%) to identify multiple acceptable design points (with delay penalty ≤5%, PNMV ≤ 0.001%, E ≤ 5%), and then we select the design point with the most relaxed processing requirements (though other selection criteria can be used, e.g., lowest EDP 95 ). Then, for each processing parameter, we select its most constrained value (i.e., the value closest to its ideal value: Table I ) over all the selected design points (one for each OpenSPARC module). These values form a processing route, and we then validate that design goals are met for all modules for this processing route (e.g., using the nonlinear model to compute delay penalty). Table IV provides processing routes for the OpenSPARC modules at the 14, 10, 7, and 5 nm nodes (highlighted entries in Table IV are limited by the PNMV constraint; other entries are limited by the delay penalty constraint). For each node, processing routes are shown for multiple delay penalty constraints to illustrate the tradeoff between delay penalty, PNMV, and processing requirements. All processing routes in Table IV meet count-limited yield ≥99.999%, resulting from minimum-width upsizing (step 2 in Fig. 3 : to reach count-limited yield ≥99.9%) and CNT process Table V. improvements; if count-limited yield <99.999%, then we can return to step 2 in Fig. 3 to increase W MIN , then repeat gradient descent (Fig. 11 ) to find processing routes. We have so far targeted delay penalty ≤5%, PNMV ≤ 0.001%, and E ≤ 5%, which maintains ≥90% of the projected EDP benefits of nominal CNFET circuits despite CNT variations. However, achieving these design goals can impose processing requirements that may be difficult to achieve experimentally (e.g., IDC = 0.19 for delay penalty ≤5% at the 5 nm node: Table IV) . In Table V , we provide alternative processing routes that maintain ≥90% of the projected EDP benefits of nominal CNFET circuits; we target design points with EDP benefit ≥90% (versus nominal) with a relaxed delay penalty constraint (≤10%, resulting in lower E to meet the EDP benefit goal).
The amount by which each processing parameter is improved is a measure of its effectiveness to improve delay penalty and PNMV (gradient descent incurs larger updates for processing parameters that more significantly impact these performance metrics, details in [17, Sec. X-B]). Table V) . R is calculated using the percentage improvement (I) and the total improvement (I Tot ) of the processing parameters 
The relative improvement is highest for IDC for all nodes, showing that IDC is a highly effective parameter to improve for reducing delay penalties and PNMV in an energy-efficient manner. From our results, we make the following conclusions.
1) The computationally efficient linearized timing model runs over 100× faster than the nonlinear timing model, and maintains sufficient accuracy to identify design points with EDP 95,sub-opt ≤ 2% for all test cases. 2) PNMV ≤ 0.001% can be efficiently computed. 3) Gradient descent is a systematic and scalable method to meet both delay penalty and PNMV constraints. 4) Gradient descent can efficiently identify multiple processing routes to meet design goals.
5) In contrast to traditional thinking (which focuses on reducing p m to ultralow values), gradient descent identifies that reducing IDC is a highly effective means of meeting delay penalty and PNMV constraints, and that reducing p m past 1% suffers from diminishing returns. Unlike trial-and-error approaches [52] , gradient descent establishes these facts in a highly rigorous manner.
V. CONCLUSION We have demonstrated a systematic methodology for joint exploration and optimization of CNT processing and CNFET circuit design to overcome the significant challenge of CNT variations. Our approach enables quick evaluation of delay variations and PNMV of CNFET VLSI circuits with >100× speed-up versus existing approaches. Our gradient descentbased framework accurately identifies the most important processing parameters, in conjunction with CNFET circuit sizing, to achieve high energy efficiency while satisfying circuit-level noise margin and yield constraints. Using this framework, an important question regarding CNT variations can be answered.
Question: What values of IDC, p m , p Rs , and p Rm should be targeted for highly scaled VLSI CNFET circuits to maintain a significant portion of their projected speed and energy efficiency benefits despite CNT variations, while also meeting circuit-level noise margin and yield constraints?
Answer: At the 5 nm node, we recommend IDC = 0.25, p m = 0.9%, p Rs = 2.5%, and p Rm = 99.99% to maintain ≥90% of the projected EDP benefits versus nominal CNFET circuits, with PNMV ≤ 0.001%, functional yield ≥99.999%, and E ≤ 5%. These processing guidelines are attractive since p m = 1% and p Rm = 99.99% have been experimentally demonstrated, p Rs = 4% has been achieved, and promising work for continued improvement of p Rs has been shown [19] . This leaves IDC to be improved by 2× (versus IDC = 0.50: shown experimentally), thus identifying CNT density variations as an important topic of research. Additionally, processing requirements may be further relaxed by combining various CNT processing techniques (e.g., CNT sorting [1] followed by VMR [32] ). Processing routes for other nodes are provided in Table V. Unlike existing trial-and-error techniques, our framework can systematically explore the large space of CNT processing options, and generate a variety of processing routes depending on CNT processing technology constraints. Such systematic exploration is essential for a successful CNFET technology to avoid potential obstacles. Future research directions include the following.
1) Incorporation of CNT-metal contact resistance variations and threshold voltage variations into our framework, as well as other CNT processing techniques (e.g., [19] ). 2) Experimental validation of model parameters for highdensity CNT growth techniques and for channel lengths closer to the ballistic regime [41] , [43] . 3) Examination of the applicability of our framework for other emerging nanotechnologies, as many emerging nanotechnologies are expected to exhibit substantial variations. Our methodology can be adapted to overcome challenges in those technologies as well. Prof. Mitra's research interests include robust systems, VLSI design, CAD, validation and test, emerging nanotechnologies, and emerging neuroscience applications. His X-Compact technique for test compression has been key to cost-effective manufacturing and high-quality testing of a vast majority of electronic systems, including numerous Intel products. X-Compact and its derivatives have been implemented in widely-used commercial Electronic Design Automation tools. His work on carbon nanotube imperfection-immune digital VLSI, jointly with his students and collaborators, resulted in the demonstration of the first carbon nanotube computer, and it was featured on the cover of NATURE. The NSF presented this work as a Research Highlight to the US Congress, and it also was highlighted as "an important, scientific breakthrough" by the BBC, Economist, EE Times, IEEE Spectrum, MIT Technology Review, National Public Radio, New York Times, Scientific American, Time, Wall Street Journal, Washington Post, and numerous other organizations worldwide.
Prof 
VI. CNT VARIATIONS & CNT CORRELATION
CNTs are subject to the following CNT-specific variations: 1) CNT Type Variations: CNTs can be either metallic (m-CNT) or semiconducting (s-CNT) [50] . 2) CNT Density Variations: described in Section II. 3) CNT Diameter Variations: the diameter of a CNT is a function of its chirality, and can lead to changes in CNFET threshold voltage and on-current [34] . 4) CNT Alignment Variations: mis-positioned CNTs cause random alignment angles with respect to the CNT growth direction, resulting in variations in CNFET channel length [30] . 5) CNT Doping Variations: CNFETs require heavily doped source and drain extension regions to achieve small parasitic series resistance. Variations in the doping concentration lead to variation in series resistance [9] .
A. Improving p m : Diminishing Returns As described in Section II-D, reducing p m past 1% suffers from diminishing returns and can be insufficient to meet design goals [16] , [52] ; Fig. 16 illustrates that p m = 0.1% does not achieve delay penalty ≤5% for the OpenSPARC modules at the 5 nm node.
B. Gaussian Approximation of CNT Count Distributions
To validate the Gaussian approximation to the CNT count distribution (as described in Section III-A), we sample the circuit delay cumulative distribution function (CDF) via MC SSTA (Section III-A) for each of two cases: using discrete (non-negative integer) CNT count variables [49] , and using the Gaussian approximation. For example, in the case of the 5 nm "gkt" OpenSPARC module, the Gaussian approximation underestimates the median delay (where the CDF is equal to 50%) by only 0.07%, and overestimates the delay spread (measured as the width between the points where the CDF is equal to 5% and 95%) by only 0.8% (Fig. 16) . We conclude that the Gaussian approximation is sufficient for our exploration purposes.
VII. CNFET DEVICE-& CIRCUIT-LEVEL MODELING
To efficiently evaluate delay penalties and PNMV for the OpenSPARC modules, we use variation-aware logic gate timing, energy, and SNM models that are built using a SPICEcompatible CNFET compact device model [23] , which is based on the virtual source model [21] . This virtual source CNFET (VSCNFET) device model accounts for several nonidealities including (but not limited to) direct source-to-drain tunneling leakage current, parasitic gate-to-plug capacitance, fringing capacitance, source/drain extension region resistance, and CNT-metal contact resistance [45] . It has been calibrated with experimental data from 15 nm gate length CNFETs [23] and with data from NEGF-based (non-equilibrium Green's function) simulations of 5 nm gate length CNFETs [4] .
We leverage the VSCNFET model to extract timing, energy, and noise margin information from SPICE simulations to build variation-aware logic gate timing, energy, and SNM models for each standard cell in our standard cell library, which is derived from the Nangate 15 nm Open Cell Library (OCL) [25] . We use the standard cell height, width, area, and maximum finger width from the Nangate 15 nm OCL for our 14 nm standard cell library, and then scale each of these dimensions by the ratio of the contacted gate pitch (L PITCH ) given by the 2013 edition of the International Technology Roadmap for Semiconductors (ITRS) [18] . L PITCH (as well as other device-and circuit-level parameters, including oxide thickness and wire resistivity) for the 14, 10, 7, and 5 nm technology nodes is taken according to the "Node Range" Labeling in the Process Integration, Devices, and Structures (PIDS) table for High Performance Logic Technology Requirements. All of our variation-aware models account for standard cell parasitics, including wire resistance to the source, drain, and gate of each CNFET, wire track resistance, and capacitance between the input, output, and supply rails (using experimentally measured values from [44] for the 14 nm node, which are then scaled for other nodes using parameters from the ITRS) [44] . See Table VI for results.
A. CNFET Device Parameter Optimization
Before building our variation-aware models for each standard cell, we first optimize the CNFET device parameters (i.e., parameters that define the geometry of the CNFETs and affect their electrical characteristics, details below) to target a high performance CNFET technology (as opposed to a low power CNFET technology; e.g., high performance versus low power options for standard cell libraries in silicon-CMOS circuits are often distinguished by the transistor threshold voltage and off-state leakage current) [18] . For example, these CNFET device parameters include the CNFET channel length, CNT-metal contact length, and the CNT diameter, all of which affect CNFET electrical characteristics (e.g., threshold voltage, parasitic capacitances, on-current, off-current, subthreshold slope, etc.). We choose to optimize the CNFET device parameters so as to minimize the EDP of an inverter with fan-out (FO) equal to four (i.e., the output load capacitance is four times as large as the input gate capacitance): a common metric for performance benchmarking and technology assessment [9] . We refer to this metric as EDP FO4 , where EDP FO4 = E FO4 T FO4 , T FO4 is the average of the rise delay (falling input/rising output) and the fall delay (rising input/falling output), and E FO4 is the average switching energy per transition. We perform the following CNFET device parameter optimization to minimize EDP FO4 (for each node and for each value of V DD ): L PITCH is held constant, and then L G (gate length), L C (CNT-metal contact length), L EXT (CNT extension length, which refers to the ungated region of the CNT between the gate and the source/drain contact, Fig. 1 [23] ), and V FB (flat-band voltage, which offsets the threshold voltage), are swept using the VSCNFET model to minimize EDP FO4 , subject to the constraint that the CNFET off-current I OFF ≤ 100 nA/µm in the nominal case (I OFF = 100 nA/µm is the target for high performance logic specified by the ITRS) [18] . Note that, L PITCH = L C + L G + 2L EXT [23] . For each combination of CNFET device parameters, we simulate EDP FO4 using SPICE and the VSCNFET model and then select the CNFET device parameters that minimize EDP FO4 . CNFET device parameter optimization results (as well as additional parameters, including gate dielectric constant, contact height, CNT-CNT spacing, etc.) are provided in Table VI . In particular, we illustrate the optimized L C values (Table VI) in Fig. 18 , demonstrating that EDP FO4 is highly sensitive to CNT-metal contact resistance (R C ). II-A) . AOI222_X1 standard cell [25] before (a) and after (b) active alignment. Fig. 15 . Delay penalty improvement due to improving pm (5 nm node, OpenSPARC modules, after steps 1-3 in Fig. 3) . IDC = 0.50, pRs = 4%, pRm = 99.99%. Improving pm from 10% to 0.1% improves count-limited yield from ≥99.98% to ≥99.995%. However, despite pm = 0.1%, delay penalty can be >10%; thus, improving pm alone can be insufficient to meet design goals. The optimized CNFET device parameters are then used as inputs to the VSCNFET model for all SPICE simulations to analyze I ON variations (Fig. 2) and to build variation-aware logic gate timing, energy, and SNM models to evaluate circuit-level performance metrics (e.g., delay penalty, PNMV, and ΔE) for each node and for multiple values of V DD (V DD = 0.50 V to compare technology nodes and V DD is swept down to 0.35 V in 0.05 V increments at the 5 nm node to evaluate the impact of V DD scaling).
B. Physical Circuit Design for Circuit-Level Analysis
For circuit-level analysis, the OpenSPARC modules are synthesized using Synopsys Design Compiler (targeting the nominal case), Capo [36] is used for placement, and FLUTE [8] is used to estimate wire lengths, with wire parasitics computed using parameters from the ITRS [18] . The full design and analysis flow is shown in Fig. 3 (Section II).
C. Selective Upsizing
We leverage the following selective transistor/logic gate upsizing algorithm (i.e., selective upsizing, inspired by [52] ) to minimize circuit EDP (Section II-A) and to reduce circuit delay penalty (Section II-B). We first sort all standard cells according to their fan-out (fan-out: the ratio of the output load capacitance to the minimum input capacitance on any input) in the nominal case. Next, we upsize the standard cell with the largest fan-out by incrementing its drive strength (e.g., INV_X1 becomes INV_X2) and then re-sort all of the standard cells according to their fan-out; a parameterized number k SelUpsize ≥ 0 of the standard cells are upsized sequentially in this manner (note that, each standard cell can potentially be upsized multiple times). If a standard cell cannot be upsized because it is at its maximum drive strength (e.g., INV_X64 is the strongest inverter in our library), then the standard cell with the largest fan-out that can be upsized (i.e., it is not at its maximum drive strength) is upsized instead.
VIII. VARIATION-AWARE TIMING/ENERGY MODEL

A. Timing Model Generation
The VSCNFET SPICE model and CNFET device parameter optimization results (Table VI) are used to build a variationaware logic gate timing model that computes CNFET logic gate delay and output slew as functions of the CNT count in each sampling region (see Fig. 8 in Section III-B for details on the CNT count variables).
To build this model, we perform over 2000 SPICE simulations for each input pin of each logic stage in our standard cell library, varying the input slew rate (t InSlew ), the load capacitance (C Load ), and the CNT counts (n 1 , n 2 , …) of each sampling region. Four values of C Load and six values of t InSlew are analyzed for each logic stage to emulate typical operating conditions in a digital system (C Load : fan-out = 2, 4, 6, and 8; t InSlew : 1-64 ps). For each of the 24 combinations of (C Load , t InSlew ), we sample 100 random values of CNT count from the CNT count distribution (using processing parameters defined in Table I ), for a total of 2400 simulations for each input pin of each logic stage. We then calibrate timing models f(t InSlew , C Load , n 1 , n 2 , …) to the delay (d) and to the output slew (t OutSlew ) values extracted from the SPICE simulations.
B. Timing Model Linearization
Solving for d in (41) (using the nonlinear timing model in [53] ) is not trivial, and may involve a numerical method that requires significantly more computational effort than a model of the form d = CV/i [48] :
Additionally, t InSlew must be determined for each input pin of each logic stage and must propagate through the circuit (as it affects the delay of subsequent logic stages), which further increases the computation time. To obtain the linearized model in (42), we linearize (41) with the following procedure:
1) Perform static timing analysis with the nonlinear timing model to calculate T Nom (Section II). This also yields the nominal delays and input slew rates for each logic stage: d Nom and t InSlewNom 2) Define a new parameter, i Drive , which is an affine function of i 1 and i 2 , and is therefore also an affine function of the region CNT counts:
3) Replace the denominator in (41) with the value of i Drive to create a first-order delay model of the form d = CV/i [as in (42) ] that gives the same value of d as the nonlinear model in the nominal case. We choose to linearize the timing model around the nominal case so that it is independent of the processing parameter values. This enables delay factorization (Section III-A) to further improve computational efficiency.
C. Timing Model Validation
We validate our timing models using circuit modules synthesized from the ISCAS-85 benchmarks [15] . For each circuit module, we compare the critical path delay (in the nominal case) computed using our timing model versus SPICE simulations (using the VSCNFET model), according to the following three-step procedure.
1) For the EDP-optimal nominal design point [defined in Section II-A: (E NomOpt , T NomOpt ) (1)], compute the nominal critical path delay using the timing model and record an arbitrary critical path. 2) Create a SPICE netlist of the cascaded standard cells on the recorded critical path (using the standard cell library described in Section VII) and instantiate capacitors to account for the capacitances of branches off the recorded critical path. 3) Perform transient analysis in SPICE and extract the delay (as the time taken for the output to reach V DD /2 from the time that the input reaches V DD /2); compare it to the critical path delay computed using the timing model in step 1. Results for three of the ISCAS-85 benchmarks are shown in Fig. 19 ; the timing model overestimates the SPICE-extracted delay by an average of 3.8%, which we conclude is sufficient for our exploration purposes.
In Section IV-A, the linearized timing model is validated against the nonlinear timing model using the processing parameter values in Table VII (see Section IV-A for details).
IX. VARIATION-AWARE SNM MODEL & PNMV
A. Minimum-Width Upsizing to Improve PNMV
As shown in Fig. 7 (c) in Section II-C, additional minimumwidth upsizing (in addition to minimum-width upsizing for count-limited yield, i.e., in Fig. 3 in Section II-A) can improve PNMV at the cost of energy due to statistical averaging. In some cases, however, target PNMV constraints (e.g., PNMV ≤ 0.001%) cannot be satisfied even for arbitrarily high energy cost (e.g., for an arbitrarily large minimum width, W MIN , for minimum-width upsizing): once the minimum CNFET width exceeds the maximum width of a single CNFET in the standard cell library (limited by the standard cell height, see Table VI) [25] , then multiple CNFETs connected in parallel (i.e., multiple fingers) are required to increase the effective CNFET width. For highly aligned CNTs with aligned-active layouts, adding fingers to increase the effective CNFET width does not achieve the full benefits of statistical averaging because of CNT correlation. For example, a CNFET consisting of two CNFET fingers that have perfectly correlated CNT counts exhibits the same magnitude of CNT count variations as either of the CNFET fingers alone (measured by the standard deviation of the CNT count relative to its mean), despite having twice the effective width.
B. Variation-Aware SNM Model Construction
As described in Section III-B1: for each logic stage input in our standard cell library, we model the VTC parameters for every case in which that input is sensitized (considering all possible combinations of the other inputs). The VTC parameters are functions of the CNT counts of the p-and ntype CNFETs which: 1) are gated by that input and 2) connect the logic stage output to either V DD or ground through a series of CNFETs in the "on" state. As an example, consider the pull-down network of a 3-input CMOS logic stage (inputs: inA, inB, inC) which consists of 4 CNFETs: A1, A2, B, and C. A1 and A2 are both gated by inA, B is gated by inB, and C is gated by inC. A1 and B are connected in series between the logic stage output and ground, and so are A2 and C (forming 2 parallel paths each with 2 CNFETs connected in series). Assume that we are obtaining the VTC of the logic stage for inA. Then when the state of (inB, inC) is (1, 0), A1 connects the logic stage output to ground through a series of CNFETs in the "on" state (since B is "on"), and A2 does not (since C is "off"). When the state of (inB, inC) is (1, 1) (i.e., another state that sensitizes inA), then both A1 and A2 connect the logic stage output to ground through a series of CNFETs in the "on" state (since both B and C are "on"), etc.
The VTC parameters are modeled separately for each input of a logic stage. For example, NAND2 gate U4 in Fig. 8 (Section III-B) consists of 4 CNFETs: 2 are gated by in1 (P4,1 and N4,1) and 2 are gated by in2 (P4,2 and N4,2). The VTC parameters for in1 are functions of n P (P4, 1) and n N (N4,1) (i.e., the CNT counts of CNFETs P4,1 and N4,1), and the VTC parameters for in2 are functions of n P (P4, 2) and n N (N4,2) (i.e., the CNT counts of CNFETs P4,2 and N4,2). Hence, two instances of T are required in the variation-aware SNM model (14) : T (NAND2_X1-in1) and T (NAND2_X1-in2) .
C. SNM Model Calibration
Section III-B1 describes how we build the variation-aware SNM model for each input of each logic stage in our standard cell library. The error of this model versus SPICE simulations is shown in Fig. 20 ; for each VTC parameter, the root-meansquare error [RMSE, defined in (44) 
D. Rapid Analysis of Circuit PNMV
As described in Section III-B2, the SNM constraints on the CNFET CNT count variables are expressed using a matrix ! ∈ ℝ !×! , such that satisfying !" ≼ ! [ (22) in Section III-B2] is equivalent to satisfying all SNM constraints in the circuit. Fig. 3 ) [15] . As an example, (45) shows the constraints !" ≼ ! for the circuit in Fig. 8 [for gate pairs (U1, U3), (U3, U5) , and (U3, U6)]. 4 Each row in H represents a single SNM constraint; e.g., the first row is an SNMH constraint for the gate pair (U1, U3) on the CNFET CNT count variables n P (P3, 1) and n N (N3,1) : the first two elements in vector s. In (45), each element of H is subscripted by its row and column indices (e.g., H 1,2 for the 1st row, 2nd column of H). These terms are determined using equations similar to (20) - (21) along with all instances of T in the variation-aware SNM model (14) . For example, H 1,2 and H 2,1 are associated with the gate pair (U1, U3), hence they are computed using T (D-latch,Out) for the output stage of D-latch U1 and T (INV_X1) for inverter U3.
Also described in Section III-B2, the relationship between the CNFET CNT count variables [s ∈ ℝ ! in (23), for t total CNFETs] and the sampling region CNT count variables [n ∈ ℝ ! in (23), for r total sampling regions] is expressed using a linear transformation s = Bn (23) , where ! ∈ 0,1 !×! has B i,j = 1 if CNFET i overlaps sampling region j, and B i,j = 0 otherwise. For example, six rows of B for the circuit in Fig. 8 are shown in (46) . The first row represents the transformation from the sampling region CNT counts to the CNFET CNT count n P (P3,1) ; since CNFET P3,1 overlaps sampling regions 1, 2, and 3, then B 1,1 = 1, B 1,2 = 1, B 1,3 = 1 (all other values of B in this row are 0), i.e., n P (P3,1) = n 1 + n 2 + n 3 .
By substituting (46) into (45) and computing K = HB (24) , all SNM constraints [on the sampling region CNT count variables, i.e., Kn ≼ 0 (25)] for the example circuit in Fig. 8 are expressed in (47) . Although (47) can be used to determine PNMV (e.g., using an MC-based approach, as described in Section III-B2), it contains many noncritical SNM constraints, which can be eliminated to more efficiently compute PNMV.
4 Note that, a single logic gate can be a driving logic gate in multiple gate pairs. In this case, U3 drives both U5 and U6, so it is a driving logic gate in gate pairs (U3, U5), and (U3, U6); there are SNMH and SNML constraints for each of these gate pairs.
E. Eliminating Noncritical SNM Constraints for PNMV As described in Section III-B2, identifying and eliminating noncritical SNM constraints is crucial to efficiently determine PNMV; Table VIII illustrates that they can account for ≥99% of the total number of SNM constraints. Here, we describe how to systematically identify and eliminate noncritical SNM constraints. As an example, consider gate pairs (U1, U3) and (U3, U5) in Fig. 8 with the following SNMH constraints (15) .
We now describe a case where SNMH(U3, U5) is always larger than SNMH(U1, U3), meaning that (49) 
These V IH terms are equivalent, since n P (P3,1) = n P (P5,1) and n N (N3,1) = n N (N5,1) (as the CNFETs in inverters U3 and U5 overlap the exact same sampling regions: Fig. 8 
, then the constraint in row i is noncritical, since it cannot be violated without simultaneously violating the constraint in row j. This property holds for s ≻ 0 (i.e., the vector of CNFET CNT counts is strictly greater than 0, meaning that there is no CNT count failure; CNT count failure and count-limited yield are accounted for separately using minimum-width upsizing for count-limited yield: Section II-A). 3) Noncritical SNM Constraint Elimination: remove all rows of K that correspond to the noncritical SNM constraints identified in step 2. As shown in Table VIII , the number of critical SNM constraints can be ≤1% of the total number of SNM constraints for the OpenSPARC modules (e.g., for "fgu").
To validate that (26) can be used instead of (25) to compute PNMV (i.e., that PNMV is unchanged by eliminating all noncritical SNM constraints): we first sample n two million times (the number of samples required to estimate PNMV ≤ 0.001% for the OpenSPARC modules: details in Fig. 21 ). Then, we evaluate both Kn and !n for each sample and verify that (25) is violated if and only if (26) is violated.
F. Efficiently Solving the MVNCDF Formulation for PNMV
We make one final adjustment to (31) (in Section III-B2) to further improve the computational efficiency of solving for PNMV. In particular, the rows of C [in (29) in Section III-B2] can be permuted so that C is a block diagonal matrix (i.e., a matrix that has main diagonal blocks that are square matrices and off-diagonal blocks that are zero matrices); for example, starting with the matrix K in (47) (in Section IX-D), take the case in which rows 1 and 2 are redundant constraints (e.g., H 1,2 < H 3,4 and H 2,1 < H 4,3 ) and remove them from K to form !. Then the covariance matrix C = !! ! is a block diagonal matrix with diagonal blocks C u (52) .
We now justify why C can be block diagonal. Let there be y rows of standard cells (e.g., row 1 and row 2 of standard cells are shown in Fig. 8 in Section III-B); each SNM constraint is associated with a single gate pair: (G (dr) , G (ld) ), where G (dr) is placed in row u of standard cells and G (ld) is placed in row v.
Then the SNM constraints [represented by the rows of ! in (26) in Section III-B2] can be partitioned according to the row of standard cells in which G (ld) is placed, such that the SNM constraints in each partition are independent of the SNM constraints in each other partition. This property holds since: 1) V OH (dr) and V OL (dr) (for G (dr) ) are independent of the CNT count in the variation-aware SNM model [i.e., T VOH1 = 0 and T VOL1 = 0 in (14): Section III-B1]. Thus, the SNM constraints (26) only bound the CNT counts of the CNFETs in the loading logic gate (the driving logic gates affects the tightness of these constraints).
2) The sampling region CNT count variables are partitioned according to the boundaries between the rows of standard cells (e.g., n 1 , … n 6 belong to row 1 of standard cells in Fig. 8 and n 7 , … n 12 belong to row 2). The consequence of these properties is as follows. Let (G i (dr) , G i (ld) ) and (G j (dr) , G j (ld) ) be the gate pairs constrained by the ith and jth SNM constraints, respectively [i.e., the ith and jth rows of ! in (26) in Section III-B2]. Then the covariance term C i,j is 0 if G i (ld) and G j (ld) are placed in different rows of standard cells. Thus, C = !! ! can be block diagonal, where the size of each block u is equal to the number of SNM constraints in which the loading logic gate is placed in row u of standard cells. For example, in (52), C 1 ∈ ℝ !×! since there are two SNM constraints that constrain the sampling region CNT counts in row 1 of standard cells. (31), shown for the 5 nm "dec" OpenSPARC module. The coefficient of variation is the standard deviation of the sampling region CNT count relative to the mean: it depends on the processing parameter values, and we sweep it by changing IDC with pm = 1%, pRs = 4%, pRm = 99.99%. 2×10 6 MC trials are required to reject the hypothesis that PNMV > 0.001% with 90% statistical power. The root-meansquared percentage error is ≤10% (for PNMV values computed with statistical power ≥90%). 
In our analysis, the size of each C u is on the order of ℝ !"×!" , corresponding to ~10 gate pairs per row of standard cells (after eliminating noncritical SNM constraints, i.e., following steps 1-3 in Section IX-E). As mentioned in Section III-B2, solving for PNMV in (53) can be efficiently computed, e.g., in less than 10 seconds for the OpenSPARC modules (using a single 2.93 GHz processor core). Table IX summarizes the variables used to compute PNMV. The accuracy of the MVNCDF formulation [both (31) and (53)] is validated in Fig. 21 .
X. GRADIENT DESCENT IMPLEMENTATION
A. Gradient Calculation
A critical path is any path between a circuit input and a circuit output with propagation delay equal to the maximum path delay. There can be multiple critical paths for a single MC trial (full circuit delay model in Section III-A). Immediately after STA of each MC trial (Fig. 10) , we numerically estimate ∇T 95 (34) via the following procedure: 1) Record an arbitrary critical path for each MC trial. These paths are used to estimate ∇T 95 using a subgradient, borrowing from the sub-gradient method for minimization of non-differentiable functions [38] (e.g., "max" in STA). 2) Decrease p m by an incremental amount (i.e., by δp m = 10 -6 ; 10 -6 is <0.1% of all the experimentally demonstrated processing parameter values in Table I) , and then recompute the path delay only for the arbitrarily chosen critical path of each MC trial. Build the CDF of these path delays, and extract the delay where the CDF is equal to 95%: this extracted delay value differs from T 95 by an amount δT 95 (pm) . Do the same for IDC and for p Rs (with !IDC = 10 -6 to compute δT 95 (IDC) and δp Rs = 10 -6 to compute δT 95 (pRs) ). 3) Numerically estimate each element of ∇T 95 using (54) .
This strategy assumes that for each MC trial, the chosen critical path remains a critical path after updating each processing parameter, which in general is only true in the limit as δIDC!→!0, δp m !→!0, δp Rs !→!0, and is an approximation for δIDC = 10 -6 , δp m = 10 -6 , δp Rs = 10 -6 .
We use a similar methodology to estimate ∇PNMV (34): for each processing parameter, decrease that processing parameter by an incremental amount and recompute PNMV (to calculate δPNMV (IDC) , δPNMV (pm) , and δPNMV (pRs) ). The elements of ∇PNMV are estimated using similar equations as (54) . We use the same numerical approach to compute ∇E Tot (34) [using E Tot as defined in (13)].
B. Gradient Descent Step
To update the processing parameters during gradient descent, we first normalize the gradient vector [e.g., ∇EDP 95 (35) or ∇ENP (36)] by its !1-norm (i.e., the sum of the absolute values of its elements). This normalizes the magnitude of the step size. We then take a step so that the improvement in each processing parameter is proportional to its corresponding magnitude in the normalized gradient, and so that the total improvement in processing parameters sums to 10%. Small step sizes require more simulations; large step sizes yield coarse granularity in exploration. This strategy (though others may be used) assumes that it is equally difficult to improve each processing parameter by a fixed percentage. For example, if the elements of the normalized gradient have magnitudes 0.70, 0.10, and 0.20, then we reduce IDC, p m , and p Rs by 7%, 1%, and 2% (versus their current values), respectively.
C. Avoiding Convergence to Local Optima
Given that our optimization methodology is based on gradient descent, we employ two strategies to avoid convergence to local optima (since the objective is not necessarily a convex function): 1) Initialize Gradient Descent from Multiple Design Points on the Initial Energy-Delay Tradeoff Curve: each instance of gradient descent typically leads to a unique design point. Even if all instances of gradient descent converge to the same local optimum, we never choose a worse design point by starting another instance of gradient descent.
2) Never Increment a Processing Parameter Away from its
Ideal Value (Table I ): The only case in which all CNT count variations are zero, by definition, is the nominal case (in which all processing parameters have their ideal values). This is consequently the global optimum in terms of minimizing the effect of variations. Any case in which the gradient vector points toward incrementing the value of a processing parameter away from its ideal value is indicative of local optima; we choose to not update that parameter in these cases.
