Models of simulation program with integrated circuit emphasis (SPICE) 
Introduction
The simulation program with integrated circuit emphasis (SPICE) models, such as BSIM, HiSIM, and PSP models characterize very large scale integration (VLSI) device's electrical characteristics (e.g., currentvoltage (I-V) curves), which are associated with a set of optimized parameters [1] - [5] . For the problem of the SPICE model parameter extraction, it usually refers to several hundred I-V points. It forms a multidimensional nonlinear optimization problem; therefore, model parameter extraction of the VLSI device is a time consuming task, and requires engineering expertise to find a set of proper parameters with reasonable physical meanings [1] - [6] . Many researches for model parameter extraction, such as pure GA or numerical optimization methods have been reported [1] , [6] - [11] , and most of them were performed separately. Unfortunately, such methods may not work efficiently for VLSI devices in the regime of deep-submicron. To overcome the problem above, we have recently developed a hybrid intelligent model parameter extraction technique which bases on the genetic algorithm, the monotone iterative Levenberg-Marquardt method, and the neural network algorithm [1] . A prototype was successfully implemented according to the proposed methodology. Extraction in a global sense shows good accuracy for the 90 nm n-type metal-oxide-semiconductor field effect transistors (NMOSFETs) by several testing cases. However, in order to accelerate the extraction process of the developed prototype for larger scale optimization problem, it is necessary to perform the parallelization of the system.
In this paper, we successfully implement a parallel optimization platform for VLSI device model parameter extraction on a Linux-based PC cluster with message passing interface (MPI) libraries. The GA implemented in the early developed system with 16 PCs is parallelized with a diffusion scheme which forms a 2D-grid network. When the stage of GA is performed on a processor, chromosomes are simultaneously exchanged among those results that computed by its neighbor 4 processors. Optimization process is then going to the next step according to the system configuration of the hybrid intelligent model parameter extraction technique [1] . Extraction will be terminated when the specified stopping criterion is met. Our extraction experience shows that this approach has distinguished results when the dimension of the problem is significantly large, such as parameter extraction for more than 8 VLSI. Compared with an isolated parallel GA, more than 33% difference in the evolution time is found between the two parallelization algorithms when 16 devices are optimized. In terms of several benchmarks, such as speedup, efficiency, and accuracy, results for different examples with multiple VLSI devices are examined to show the robustness and efficiency of the method. Theoretical estimation and preliminary implementation show that there are an optimal number of processors with respect to the number of devices to be extracted. For example, according to our theoretical calculation, the optimal number of units is 18 which is close to the practically obtained result (16 units), shown in the table 3, 4, and 5, respectively.
This article is organized as follow. In Sec. 2, we briefly describe our extraction system and state the architecture of parallel computing algorithms. In Sec. 3, we show the extraction results for single and multiple deep-submicron and sub-100 nm N-MOSFET devices. Finally, we draw conclusions.
Parallelization of the Model Parameter Extraction System
Under this section, the proposed architecture for the parallel optimization platform is described first, followed by a theoretical estimation on the optimal parallel performance of the diffusion GA.
The Parallel Architecture
Mathematically, model parameter extraction is a multidimensional nonlinear optimization problem, where the number of parameters is larger than 100. The main goal of device model parameter extraction is to minimize the error between the extracted result and the measurement, where the extracted result is obtained through the equation below:
where the I ex DS is the I-V functions (e.g., I-V points, shown in Fig. 4 ) to be optimized; the I D is a selected compact model [1] - [4] , which contains more than 40 
are the parameter sets to be extracted, the bias condition for simulation, and the device geometry, respectively. There are at least 50 I-V points forming an I-V curve, 5 I-V curves forming a set of I-V curves, and 4 sets of I-V curves to characterizing a single device behavior. Therefore, a device model parameter extraction problem can be formed as follow:
where I me DS is the measured I-V point, and d, cs, c, p refer to the number of devices, curve sets, curves, and I-V points, respectively. When perform a model parameter extraction with 16 devices as target, there are 16000 I-V points need to be minimized, and the number of parameters are more than 100. The nonlinear optimization problem is subject to proper physical constraints.
This large scale optimization problem with massive computation is performed on our early proposed extraction system. The developed hybrid optimization platform integrates the genetic algorithm, the monotone iterative Levenberg-Marquardt method, and the neural network algorithm, shown in Fig. 1 . When the GA obtains a solution, the monotone iterative Levenberg-Marquardt method is activated to search for the nearby local optima, and the neural network algorithm suggests proper searching directions according to the current result and physical constrain. The detailed description of this extraction system is reported somewhere else [1] .
Although the extraction system has been proposed and implemented successfully, facing a larger scale complex optimization problem with massive computation still requires enormous amount of time, thus a parallelization technique is required. On the other hand, the time acquired by the monotone iterative Levenberg-Marquardt method and neural network algorithm can be regarded as instant compared with the time cost of GA. Therefore, only the GA is required a parallelization technique. Application of parallelization to GA provides an efficient way to reduce the computing time [12] - [15] . GA is a self-adaptive optimization strategy that mimics a living system, it usually contains five operations: encoding, fitness evaluation, selection, crossover, and mutation. We briefly state GA methods for the MOSFET device model parameter extraction. The design of gene encoding strategy depends on the property of problem. In this problem model, there are more than 100 parameters and all variables are floating-point numbers. The fitness function measures the error between simulated result and realistic measurement data. As for the reproduction issue, we adopt the tournament selection with floating point operators as the selection strategy not only this hybrid strategy selects better chromosomes but also keeps weak ones for few generations to achieve higher population diversity. For the crossover scheme, in MOS-FET device model, all parameters to be optimized can be classified into several categories and each of them stands different physical characteristics [1] . Under this consideration, we take a uniform crossover scheme to preserve the physical characteristics of the parents; and based on our simulation experience, it is more effective than single and two-point crossover schemes. Finally, the mutation strategy changes the mutation rate dynamically to keep the population diversity. Such evolutionary optimization may take a long time when the dimension of investigated problem is large; in particular, for nanoscale VLSI device model parameter extraction [1] . To reduce the time cost of optimization, parallel schemes are taken into consideration.
It is known that the parallelization of GA can be classified into five different models, the isolated, the ring migration, the neighborhood migration, the unrestricted migration, and the diffusion GA [15] . Each unit in the isolated configuration performs the extraction tasks separately, and there is no data communication among units. The obvious advantage of the isolated architecture is spending less communication time in the extraction procedure; however, the isolated evolutionary environment may lead to the striking decrease of the population diversity. Contrast to the isolated GA, each extraction unit of the migration GA is treated as a separated breeding unit, and the migrations between each unit occur from time to time to pro- Each individual is assigned to a specific location, and the migration is permitted between a set of specific neighbors.
In advanced VLSI device model parameter extraction, parameters according to their engineering meanings can be classified into several groups, and each group represent specific physical phenomenon [1] . By applying the diffusion GA, we can assign each column in the 2D-grid units to optimize different groups of parameters. This configuration also corresponds to our optimization method thus here we conclude that the diffusion GA is the most suitable distributed configuration. According to our extraction experience, the isolated GA and diffusion one are compared and focused for a series of comparison. The parallel extraction system is implemented on a PC-based Linux cluster with 16 units. Each unit is connected to a high speed network switch physically and performs automatic parameter extraction. The entire system architecture can be classified into two modules, the management server and the extraction cluster. The server controls whole extraction system. It analyzes the complexity of the problem. Based on the analysis results, the server sets the configurations of the system architecture up, and allocates proper computing resources. In the extraction process, the server monitors the extraction process, backs necessary information up, controls the extraction flow, and communicates with the other extraction modules. The extraction cluster consists many extraction units, each one can be regarded as an independent extraction entity or participate in the distributed parameters extraction process under the control of the extraction management server. Figure 2 shows the working flow of our distributed parameter extraction engine. Once the procedure starts, the environment is initialized firstly, and each unit (or processor) begins their job, and sends the current result to server if data transmission is required. This procedure loops until the fitness score is reached or the evolution time is up.
A Theoretical Estimation
Furthermore, a theoretical estimation on the optimal parallel performance of the diffusion GA is discussed for the implemented parallel extraction system. Assume that there are p processors, the communication time cost is T c , n indicate the population size, and the total evaluation time is T f . In our implemented diffusion GA, we set the number of neighbor of each unit as 4. Thus the entire time cost for one generation T p is given by
where the 4pT c is the extra communication cost from the diffusion GA. As more processors are used, the computation time T p decreases as desired, but the communication time increases. This tradeoff entails the existence of an optimal number of processors that minimizes the execution time. To find the optimal result, we set ∂T p /∂p = 0 and solve the corresponding equation for p:
The time that a sequential GA uses in one generation is T s = nT f , and to ensure that the parallel implementation has a better performance than a sequential GA the following relationship holds
This ratio is the parallel speedup for the diffusion GA, and it formalizes the intuitive notion that parallel does not benefit problems with very short evaluation times. Another concern when implementing parallel algorithms is to keep the processor utilization high. Formally, the efficiency of a parallel program is defined as the ratio of the parallel speedup over the number of processors:
Theoretically, the parallel speedup should be equal to the number of units to be used, and the efficiency equals 100%.However, the cost of communications causes the efficiency to decrease as more units are used. To set an economical number of units (p e ) that maintain a pre-estimated efficiencyÊ f , we let Eq. (6) equal toÊ f and solve the corresponding equation for p. The computed p e is given by
We note that p e = p * whenÊ f is 0.5. The maximum speedup achievable by the diffusion GA equals half optimal number of units. In our experiment, the communication time cost T c is approximately 32 ms, and the evaluation time T f is around 0.068 second for 16 devices simulation, and the population size is set to 800. As a result, from Eq. According to the point of view above, if more units are included in the parallel extraction system, the speedup will not make any further improvement; moreover, the speedup might decrease due to heavy communication in the used network. We practically implement such parallelization schemes in our hybrid optimization prototype for VLSI device model parameter extraction. Achieved results confirm the theoretical estimation.
Results and Discussion
In this section, three issues are examined. The first one shows the performance comparison of the isolated and diffusion GA, the second issue demonstrates the robustness of our optimization method. Finally, the parallelization configuration of this work is discussed. In our extraction experiment, the industrial standard BSIM4 SPICE model is adopted. Figure 3 shows a comparison of the amount of evolution time with respect to the number of extracted devices between the isolated and the diffusion GA. As show in this figure, the evolution time is almost the same as the search domain is small. However, when the search domain is increased, i.e., the number of devices to be extracted is greater than 4 devices, the superiority of the diffusion GA is observed. When the number of the target devices is increased to 16, the 33% speedup of the evolution time of the diffusion GA is observed, compared with the speedup of the isolated one. However, for problem with small search domain, such as 1 or 2 devices, the difference between two parallel methods is insignificant. With this experiment, we suggest that the diffusion GA is one of suitable distributed methods in parallelization for this problem.
We further perform a series of experiments to examine the accuracy and efficiency of the proposed method. ters demonstrates good accuracy of the optimization method. Table 1 and 2 shows the extraction result for 16 N-MOSFETs of 90 nm technology. As shown in this table, RMS error of curves is strictly within 3% and 6% for all original curves and the first derivative of all original curves, respectively. We note that the first derivatives with respect to the curves of
and
Figures 4 and 5, and Table 1 and 2 confirm the accuracy of the proposed method with respect to As shown in Fig. 6 , the experiment verifies the capability of the implemented parallel extraction system with respect to different number of working processors and different problem sizes. The accuracy for all extracted VLSI devices is strictly set to be within 3% error for all original curves and 6% error for the first derivative of all original curves. Table 3 to 5 shows the benchmark results and confirms that the speedup is increased as the number of units is increased. On the other hand, it is known that the efficiency appears to have a trend of decrease which confirms the optimal parallelization of GA [14] - [15] . We concluded that the most suitable number of processors and acceptable execution time should be 8 processors for extracting 4 and 8 devices with the BSIM4 SPICE model and 16 processors for extracting 16 devices. More detailed data are listed in Tab. 3, 4, and 5, respectively. Number of CPUs 
Conclusions
In this paper, parallelization of the genetic algorithm for VLSI device equivalent circuit model parameter extraction has been developed. The GA implemented in the extraction system has mainly been parallelized with a diffusion scheme on a 16-PC-based Linux cluster with MPI libraries.
Parallelization shows that the diffusion GA is superior to an isolated GA, and the superiority of the diffusion GA is significant when the number of devices to be optimized is increased. Estimation on the optimal number of processors with respect to the number of devices to be extracted was considered. Preliminary implementation has shown a good agreement with the theoretical estimation in the developed prototype. Speedup and efficiency including accuracy of extraction have been reported and discussed for different sets of realistic multiple VLSI devices. The practical implementation of parallel GA approach benefits the engineering of SPICE model parameter extraction. To validate the developed parallel intelligent model parameter extraction prototype for sub-65 nm VLSI devices, more advanced SPICE models, such as HiSIM and PSP models are currently implemented in this system. In addition, we perform the extraction on a 32-units PC-based Linux cluster for much higher performance computation.
