Abstract-Technology computer-aided design (TCAD) device and small circuit simulations use numerical and physics models to investigate the properties and performance of circuits before they undergo fabrication. Thus, these simulations play a significant role in VLSI design, optimization, and verification. However, they suffer from poor convergence and high CPU times, especially when performing TCAD mixed-mode simulations. In this paper, we propose a new simulation flow to address this challenge. We use device states captured from single devices to build a device state library. Then, we leverage device-level solutions to form a global initial guess for circuit-level simulations that are based on the full Newton algorithm. This leads to a significant efficiency enhancement. The average speedup for quasi-stationary (or dc) operating point establishment for a standard cell library and a mirror adder is 6.9× and 21.2×, respectively, whereas the CPU time for static random access memory static noise margin extraction is reduced by 47.0%.
I. INTRODUCTION
T ECHNOLOGY computer-aided design (TCAD) simulations play an extremely important role in device and small circuit design, extraction of compact models, technology rule development, and verification of manufacturability. They start from the physics configurations of devices and circuits, taking into account the carrier transport models and related electrical properties, and bridge the gap between a broad range of physics and electrical behavior models.
TCAD simulations can be classified into two types based on the intended application: device and mixed-mode. Device simulations are typically used to study the performance of electronic devices, whereas mixed-mode simulations target small circuits. A TCAD mixed-mode simulator provides a link between technology parameters and circuit performance. It is, therefore, particularly useful in investigating the impact of variations in technology and device designs on circuit properties.
TCAD mixed-mode simulations provide superior accuracy to compact models at the expense of higher CPU time, though both can be used to study the performance of small circuits [1] . The basis of TCAD simulations is a set of carrier transport equations. Another advantage of TCAD mixed-mode simulation is its ability to investigate circuits composed of new devices. With conventional metal-oxide-semiconductor fieldeffect transistors running out of steam, many new devices, e.g., FinFETs, trigate FETs, and tunnel FETs [2] - [4] , have emerged. This requires dealing with new physics effects, e.g., quantum tunneling current, which can only be handled by physics equation-based TCAD simulations. Furthermore, in the early phases of technology development, reliance on conventional compact models is not feasible since such models often require inputs from accurate device simulation results, which are usually provided by TCAD simulations. Therefore, although both TCAD mixed-mode simulations and compact models have been widely used to characterize small circuits, in this paper, we mostly concentrate on TCAD mixed-mode simulations.
High CPU time is the bane of TCAD simulations, especially for mixed-mode simulations. Even a single static random access memory (SRAM) cell static noise margin (SNM) extraction simulation can take several CPU hours to converge. Usually, the Newton iteration method is used to solve the set of transport equations involved in TCAD simulations. Since the convergence time of the Newton method is proportional to the number of iterations and equations, which is linear in the number of meshed grid nodes, TCAD mixed-mode simulations for larger circuits can consume extremely high CPU times [5] . Consequently, use of accurate physics model-based TCAD mixed-mode simulations for these circuits is impeded by a computational barrier, known as the TCAD barrier.
To overcome the above-mentioned problem, previous work has focused on solution-reuse strategies. A well-known approach for speeding up device simulation is to reuse the device states from a solved simulation as an initial guess to a new simulation of the same physical structure [6] . Such a feature is provided in many widely used device simulation tools, e.g., Sentaurus TCAD from Synopsys [7] . A recent work [8] extends this feature to device simulations under different transport equations and physics models. For example, quantumhydrodynamic device simulation is sped up with strategic use of less powerful hydrodynamic device simulation. This significantly improves the convergence of the more advanced transport models, which is a serious concern at the current technology nodes, while providing the added benefit of faster device simulation. However, this reuse strategy is typically limited to device simulation, and not circuit simulation, since capture and reuse of states from a circuit is much more difficult than doing so from an individual device for the following reasons.
1) The reuse strategy assumes that the physical structure of the entity being simulated does not change. This is very limiting in the case of circuits where we may want to simulate different circuit topologies or standard cells of different sizes. 2) Mixed-mode simulations of the nominal circuit may be too slow or fail to converge when more advanced, and hence more precise, physics models are adopted. In this paper, we address these shortcomings of prior work by using a combination of device-level solutions to form an initial guess for circuit-level mixed-mode simulations. Furthermore, using solutions captured from single-device simulations, we show how to develop a device state library to efficiently generate an appropriate initial guess for the mixedmode simulation. This leads to a dramatic simulation speedup. We report results of using our methodology to characterize various standard cells, SRAM array, and a mirror adder. Though we present results based on the 22-nm FinFET technology, the proposed simulation flow is a very general one and can be applied to other technologies as well.
The rest of this paper is organized as follows. In Section II, we provide the motivation for improving the efficiency of TCAD mixed-mode simulations. In Section III, we review related work. In Section IV, we present background material on TCAD simulations and various Newton iteration solvers. In Section V, we discuss the proposed methodology in detail. In Section VI, we present simulation results. Finally, we draw conclusions in Section VII.
II. MOTIVATION
In this section, we provide the motivation for reducing CPU times of TCAD mixed-mode simulations with initial-guess strategies.
A precise simulation before actual fabrication is necessary to help engineers explore the physics behind the circuits being simulated. In order to achieve higher simulation accuracy, more sophisticated physics models, complex transport equations, and lower error tolerance need to be targeted. This, unfortunately, significantly increases the CPU time for simulation.
A conventional approach for reducing the CPU times is to adopt a more time-efficient solver for the set of partial differential equations (PDEs) involved in TCAD mixed-mode simulations. Previous work has shown that for a set of PDEs captured in matrix form, as shown in
the time cost of the direct solver
, whereas the time cost of iterative solvers, e.g., the full Newton algorithm, can be reduced to O(P K ), where P is the number of iterations and K is the dimension of matrix A that may be extraordinarily large (K is proportional to the number of meshed grid nodes in the device or circuit being simulated). Thus, since P is typically much smaller than K , iterative solvers are preferred. The convergence time of an iterative solver strongly depends on its initial guess [5] . Previous work has suggested efficiently acquiring an appropriate initial guess with solutionreuse strategies [6] . While such an approach works in the case of a single device, it has not been used to simulate standard cells or small circuits because of convergence problems. More importantly, cell or circuit-level solution reuse is very difficult in TCAD simulations, since the initial guess for such simulations consists of electrical parameter values from thousands of meshed grid nodes. In contrast to compact models that rely on circuit topology as input, the whole circuit needs to be covered by meshed grid nodes in TCAD simulations. This increased grid count adversely impacts both convergence and CPU time. Thus, it is infeasible to apply initial-guess strategies to TCAD simulations directly.
To address the above-mentioned problems, we propose to replace the conventional device-level initial-guess TCAD simulation strategies, e.g., reuse from a previous devicesimulation run [7] , [9] , with the store-load-combine flow shown in Fig. 1 . First, we build a device state library that stores device-level solutions. Then, for the given circuit, we estimate the working state of each device in it. Next, we either load the corresponding presolved state from the library or perform a single-device simulation to capture the state. This leads to an appropriate initial guess for the whole circuit by combining all device-level solutions, reducing CPU times substantially.
III. RELATED WORK
As mentioned earlier, circuit-level simulations, especially transport equation-based TCAD simulations, suffer from very high CPU times and poor convergence behavior. Hence, much research has been devoted to addressing both these problems.
The various approaches for reducing CPU times of TCAD simulations include improvements in physics models, more efficient equation solvers, and sacrificing accuracy through use of simpler physics models. Hellings et al. [10] proposed a quasi-3-D mixed-mode simulation method to significantly improve simulation efficiency with only a slight loss of accuracy. Chaudhuri et al. proposed a speedup mechanism for FinFET device simulation under process-voltage variations based on physical configuration mapping [6] . Baccar et al. solved the convergence problem in mixedmode simulations of the deep trench termination diode [11] . Cui et al. studied the convergence behavior of TCAD mixed-mode simulations of electrostatic discharge protection devices [12] . However, efficient mixed-mode simulation with good convergence behavior for more general circuits is still not well-researched.
IV. BACKGROUND
In this section, we present background material on TCAD simulation methodology and various Newton iteration algorithms.
A. TCAD Simulation
TCAD simulations are based on physics models and transport equations, as shown in Fig. 2 [13] . Three types of transport models (drift-diffusion, hydrodynamic, and quantum-hydrodynamic) are included in the scope of TCAD simulations. The drift-diffusion model is one of the most fundamental and popular physics models, and is still widely used in the industry due to its high efficiency and acceptable accuracy [1] . The hydrodynamic model accounts for various complex effects, such as velocity overshoot and effective mass spatial variation, due to its reliance on more complex physics models. Thus, it provides a good balance between simulation accuracy and efficiency. Furthermore, the quantumhydrodynamic model and the Boltzmann transport equationbased model are usually more precise for deep nanoscale device modeling [14] - [16] , since they are able to account for quantum effects like the tunneling current. More complex models based on quantum Monte Carlo simulation, Green's function, and Schrödinger equation are not included in the TCAD scope because of their extremely high CPU times. We adopt the hydrodynamic transport model to demonstrate our methodology. However, it is equally applicable to other transport models.
There are two types of TCAD simulations: device and mixed-mode. Device simulations use carrier transport equations to characterize single device behavior. Mixed-mode simulations, with contact and circuit equations added, are used to characterize standard cells and small circuits and are much more complex than single-device simulations. Often, mixedmode simulations employ less rigorous transport models to alleviate very high CPU times.
TCAD simulations are grounded in physics equations, which can be transformed to a set of PDEs. After linearization and discretization, the set of PDEs can be transformed to a linearized equation system. The key issue in improving TCAD simulation efficiency is to solve the given linearized equation system expeditiously.
B. Newton Algorithm
As mentioned earlier, iterative solvers are more suitable for TCAD simulations. Previous work [17] - [19] suggests three algorithms to solve the PDEs involved: 1) two-level Newton algorithm; 2) full Newton algorithm; 3) slow embedding Newton algorithm. Fig. 3 shows the flowchart of the two-level Newton algorithm. First, circuit-level equations, e.g., based on Kirchhoff's rules, are obtained for the various circuit elements. Then, the transport equations are solved separately for each device in the circuit, with the circuit setting the boundary conditions, until convergence is achieved in all devices. Next, the conductance and current of all devices are calculated and assembled in the circuit-level equations. Then, the circuit-level equations are solved and a convergence check performed to decide whether another iteration is needed.
In the full Newton algorithm, the device and circuit-level equations are modeled together as a set of PDEs. The Newton iteration method is applied to the whole system of equations, and the complete set of unknowns is solved simultaneously.
In the slow embedding Newton algorithm proposed by Grasser et al., an iteration-dependent conductance is introduced between each device and ground to stabilize the coupled system as well as to moderately decouple the device from the circuit equations. A good choice of such a conductance can improve the convergence behavior significantly [19] . Table I shows a comparison of the two-level, full, and slow embedding Newton algorithms. In the slow embedding Newton algorithm, the introduced conductance is chosen in a purely empirical fashion. Thus, there is no guarantee of good convergence behavior. For this reason, we mainly focus on the other two algorithms. The two-level Newton algorithm usually incurs a higher CPU time expense, which is what we would expect since it requires more iterations to converge (all inner loops need to achieve convergence for each step of the outer loop), whereas the full Newton algorithm only requires the convergence of the whole system-level loop. The three solvers deliver exactly the same result and accuracy since they share identical physics equations. In addition, in contrast to the full Newton algorithm, which usually requires a good initial guess to guarantee convergence, the two-level Newton algorithm is more robust in terms of being able to converge even when the initial guess is not that good. Hence, the selection between the two-level and full Newton algorithms is a tradeoff between convergence and efficiency.
In this paper, we provide a new simulation flow that bypasses the initial guess generation problem in the full Newton algorithm and thus combines the benefits of both algorithms: more efficient than the full Newton algorithm, while offering as much robustness as the two-level Newton algorithm.
V. METHODOLOGY
In this section, we first discuss the device state library in detail. Then we use it in two speedup techniques employed in the proposed methodology.
A. Device State Library
The main idea is that by simulating a single device in various working states, we can obtain a set of device-level solutions to establish a device state library, which can be used to build suitable initial guesses for circuit-level mixed-mode simulations, thus saving significant CPU time.
The device state library consists of two components: 1) six basic states for all transistor types, added to the library initially; 2) other working states that occur during mixed-mode simulation, which are added as needed. The device state library is initialized with six basic working states, for both nFinFET and pFinFET. Table II shows these states for a typical transistor with three electrodes (gate, drain, and source). It includes all possible logic-value combinations (assuming the drain and source electrodes are symmetric). These logic-value combinations are common device working states in digital circuits. Hence, they are initial members of the library. Later on, other working states are included in the library as they are encountered during simulation. For mixed-mode simulation, we simply fetch the corresponding presolved solution from the library to form a component of the initial guess.
Next, we discuss how TCAD mixed-mode simulation can be significantly sped up with the help of the device state library.
B. Speedup Mechanism 1: Smallest Subnetwork Simulation
The working state of each device in the circuit needs to be obtained first before the solution for this state can be fetched from the device state library. This initial-guess establishment procedure has three steps: 1) estimating the voltages applied to each device; 2) loading the corresponding solution from the device state library; 3) combining device states to form an initial guess for the whole circuit. The most critical step is voltage evaluation (Step 1), since the device state library makes solution loading (Step 2) and combining (Step 3) straightforward. To correctly and efficiently estimate the voltages applied to each device, we employ a smallest subnetwork simulation technique. We explain this technique with the help of a two-input NAND gate (NAND2), shown in Fig. 4 , whose inputs A and B are set to 0.
For a NAND2 with inputs 00, output Z is logic 1, corresponding to a voltage of V dd . Hence, the working state of both the pFinFETs (pFinFET1 and pFinFET2) can be captured in a straightforward manner, as shown in [assuming the top (bottom) electrode is the source and the bottom (top) electrode is the drain for pFinFETs (nFinFETs), and However, for both the nFinFETs (nFinFET1 and nFinFET2), the situation is more complicated. The voltage at node p, which is the source electrode of nFinFET1 and drain electrode of nFinFET2, is neither 0 nor V dd . Hence, smallest subnetwork simulation needs to be employed to extract voltage V p at this node. In this case, the subnetwork is the pull-down network of NAND2, as shown in Fig. 5(a) , since its simulation would yield a value for V p . In general, the smallest subnetwork of a cell or circuit is its smallest part from which we can obtain the electrical parameter value of interest. We obtained V p = 0.022 V after simulating the pull-down network. Hence, the working state of nFinFET1 and nFinFET2, respectively, can be obtained as shown in
Now, we have obtained the working states of all devices. These states can be used to form an initial guess for NAND2 in the loading and combining steps.
C. Speedup Mechanism 2: Equivalent States
Next, we present another speedup mechanism based on equivalent states. Consider the following two situations: 1) NAND2 with inputs 00; 2) three-input NAND gate (NAND3) with inputs 001. Both gates have logic 1 outputs. Fig. 5 shows the pull-down networks for the two cases. Suppose that we have already acquired the working state for NAND2 (see Section V-B). In the NAND3 pull-down network, since nFinFET3 gate input is logic 1, V 3 is directly connected to ground. Therefore, the smallest subnetwork in this case consists of nFinFET1 and nFinFET2, as shown by the dashed box. However, this is identical to the NAND2 pull-down network shown in Fig. 5(a) , and thus shares its working state. Hence, V 2 = V 1 = 0.022 V. This leads to the working state of NAND3 under the 001 input.
In general, if the working state of a circuit can be deduced from a presolved or known working state of another circuit, then the former is said to be derivable from the latter. If the working states of the two circuits are derivable from each other, we define them as equivalent states.
We always perform an equivalent state check before embarking on the smallest subnetwork simulation. Whenever it is successful and the working state is found in the device In this section, we discuss how the proposed mixed-mode simulation method can be used to capture the quasi-stationary operating points of logic gates in the 22-nm FinFET standard cell library shown in Table III and a mirror adder. We also apply our method to extract the SNM of SRAMs. The 22-nm FinFET parameter values are obtained from a combination of device fabrication and device simulation [20] , [21] . We implement our methodology on Sentaurus TCAD from Synopsys [7] . We use the hydrodynamic transport model. However, as mentioned earlier, the method is not limited to this transport model. We adopt the band-to-band and ShockleyRead-Hall models for the recombination terms in the transport model and velocity overshoot models for mobility correction. We use more than 10 5 mesh nodes for each FinFET to ensure accuracy. We run all the simulations on a PC with Intel Ivy Bridge processors operating at 2.5 GHz, with 8-GB memory. We demonstrate the improved efficiency of our methodology by comparing our simulation results with those obtained from traditional TCAD mixed-mode simulations.
A. Simulation of Standard Cells
We use our methodology to extract the quasi-stationary operating points of standard cells. Table IV shows the working state, leakage current I leak , and cell output, where V p and V q are defined in Figs. 4 and 6 .
We obtain a significant reduction in simulation time, as shown in Table V , where T old depicts the original mixedmode simulation time, T extra is the time for obtaining the initial guess, T new is the mixed-mode simulation time for the proposed methodology that takes advantage of the initial guess, and r depicts the speedup, as defined in (5)
T extra consists of the following three parts: 1) device state library initialization with six basic working states; 2) smallest subnetwork simulation for working state estimation; 3) addition of new device states to the device state library. Note that library initialization is a one-time cost, since we just need to initialize the device state library in the beginning. The average (highest) speedup is 6.9× (148.2×), as shown in Table V . We observe that T extra = 0 in many cases. This is due to either direct extraction of the working state (e.g., INV with input 0) or equivalent states (e.g., NAND3 with inputs 001). Once the library is built, total simulation time can be reduced significantly because the initial guess obtained from the device state library is very close to the final solution. Note that the traditional and proposed TCAD mixed-mode simulations yield exactly the same result, since they use identical physics models and equations. The only difference between them is the initial guess for the full Newton solver, which has no impact on the final solution.
In addition, our proposed simulation flow has no memory overhead compared with the traditional TCAD flow, as shown in Fig. 7 . This is due to the approximate linearity between the number of meshed grids and the total memory requirement. Generally, a fixed size of memory is allocated to a single meshed grid to store its electrical parameters. Since the proposed method does not change the meshing strategy, memory usage remains roughly the same.
B. Simulation of SRAMs
SRAMs are widely used in integrated circuits today. Fig. 8 shows the structure of a typical 6T SRAM cell. With increasing process variations and lowering of supply voltages in nanometer-scale technologies, SRAMs have become increasingly vulnerable to noise. Hence, finding the SNM is an important consideration in SRAM design. The SNM of an SRAM cell can be extracted using the following three steps.
1) Set the supply voltage to V dd .
2) Set sweep starting points (voltages at WL, BL, and BL) for hold, read, and write operations, as shown in Table VI . 3) Extract the voltage transfer characteristic (VTC) curves between P and Q. The first two steps are essential requirements for setting up the correct initial quasi-stationary working configuration for the SRAM cell, though they do not contribute to the VTC curves. We avoid these steps by obtaining a sweep starting point directly from the device-level solutions stored in the device state library.
In our methodology, we need to obtain approximate voltages at all electrodes of all devices of the circuit. In the SRAM cell, this means acquiring the values of V q and V p at the VTC sweep starting point. By simply analyzing the SRAM structure, we can obtain the V p and V q values as shown in Table VI . Then, corresponding device-level solutions are loaded from the library and combined to form a global initial guess.
SRAM array simulations can also be sped up by reusing simulation results of individual SRAM cells. Since simulation of an SRAM cell in the array has limited impact on other cells in the array, the SRAM cell itself typically forms the smallest subnetwork for working state estimation. This helps establish the initial guess and enables us to capture the quasi-stationary operating points and SNMs of an SRAM array much more efficiently.
We simulated a single SRAM cell, 1 × 3 SRAM array, and 3 × 3 SRAM array. The extracted hold, read, and write SNMs of an SRAM cell are shown in Figs. 9(a), 10(a), and 11(a), respectively.
The corresponding simulation times for the original and proposed flows are shown in Figs. 9(b), 10(b), and 11(b), respectively. Relative to the original flow, we obtain a simulation time reduction of 47.0%, 64.7%, and 61.1% in the total SNM extraction time (the sum of hold, read, and write SNM extraction times) for the SRAM cell, 1 × 3 SRAM array, and 3 × 3 SRAM array, respectively.
Since the Newton iterative solver displays quadratic convergence [5] , the initial guess plays a very important role in TCAD mixed-mode simulations. Thus, the key to reduced simulation times is finding an initial guess close to the final solution, which is what the device state library enables.
C. Simulation of a Mirror Adder
Adders are one of the most fundamental building blocks of arithmetic logic units. Although there are several adder implementations, mirror adders are one of the most Fig. 12 shows a typical mirror adder structure with 24 transistors.
The outputs of the full adder, C o and S, under various combinations of inputs (A, B, and C i ) can be obtained by
For all the smallest subnetworks involved in mirror adder simulation, we can find an equivalent state from the simulation of standard cells. Thus, the working states of all transistors can be extracted and loaded from the library effortlessly, and the extra time cost T extra is 0. Table VII shows the simulation result as well as time cost. The average (highest) speedup for quasi-stationary operating point establishment of a 24T mirror adder is 21.2× (27.4×).
VII. CONCLUSION
In this paper, we described a new flow to significantly reduce the high CPU times of TCAD mixed-mode simulations. As opposed to conventional device-level initial-guess techniques, our flow is based on first obtaining a device state library that stores the working state of the devices in the circuit. These device-level presolved solutions are combined into a global initial guess for the whole circuit. We used our flow to simulate various standard cells, SRAM cell/arrays, and a mirror adder, and demonstrated significant speedups. In the future work, this approach could be extended to transient or larger-scale circuit-level simulations. He has co-authored another six papers that have been nominated for best paper awards. He has served on the program committees of over150 conferences and workshops.
