Abstract. We demonstrate that the parallel computing with graphic processing unit (GPU) effectively accelerates a particle-based carrier transport simulation called EMC/MD method. The simulation speed is increased by parallelizing the point-to-point Coulomb's force calculation, which is sufficient to accomplish a device characteristic simulation of nanostructured metal-oxide-semiconductor field effect transistor (MOSFET) including source and drain diffusion regions. The EMC/MD simulation powered by GPU computing is a useful tool to investigate the statistical variability analysis of nano-scale transistors.
Introduction
As the progress of scaling and integration of Si devices, the static device-to-device characteristic variability and dynamic variability due to current fluctuation become serious concerns [1] [2] [3] [4] . The static variability is induced by lacks of the structural integrity such as the random impurity-ion distribution in the device. The dynamic variability becomes prominent with the decrease in the number of transported carriers in the small devices. To extract the major factors from these variability sources, a detailed device characteristic simulation is indispensable to describe the discrete impurity ions and mobile charges.
The Ensemble Monte-Carlo / Molecular Dynamics (EMC/MD) method [5, 6] is a particle transport simulation method based on the classical Boltzmann transport equation. In this method, carriers are treated as point charges, and their real-space trajectories under the point-to-point Coulomb's potentials are calculated by the MD algorithm. The acoustic and optical phonon scattering are taken into account as stochastic changes in the momentum of carriers according to the standard energy-dependent formulations based on the Fermi golden-rule-type approach in the bulk band structure. The carrier-carrier scattering and carrier-impurity scattering are directly described by the MD algorithm. Instantaneously unscreened charges of dopants in the source/drain regions are considered. Therefore, EMC/MD simulation is suitable to investigate the impact of random impurity distributions and discrete carriers on the device characteristics. However, the MD algorithm requires huge computation resources to calculate the Coulomb interactions between all charged particles in the system, and it has prevented the full device simulation of MOSFETs including source and drain regions with large number of carriers and impurity ions.
Our research group has worked on the acceleration of the EMC/MD method [7] . We have employed MD-GRAPE [8] , which is a dedicated hardware to calculate point-to-point Coulomb's interactions, and its emulator of Phantom-GRAPE [9] which utilizes the SIMD extension operation instruction equipped on CPU. Although the dedicated hardware approach provide only the parallelization of Coulomb force calculation, it is not suitable for parallelizing the entire algorithm of the EMC/MD method. Furthermore, it requires high implementation cost.
In recent years, graphic processing unit (GPU) has attracted much attention due to its large-scale parallelism ability with the excellent cost performance. Many studies have demonstrated that MD and Monte Carlo (MC) algorithms are effectively accelerated by utilizing GPU [10, 11] . Most PCs are equipped with the graphic board, so that the GPU parallel computing environment is easily available, which is a clear distinction from conventional approaches utilizing dedicated hardwires.
In this work, we demonstrate that the EMC/MD method can be effectively accelerated by the GPU computing. The execution speed of the EMC/MD calculation is enhanced by 10 times and more. We carried out a full device EMC/MD simulation of Si nanowire (NW) transistor including source and drain regions within a reasonable time. The EMC/MD simulation powered by GPU computing opens a door to the statistical analysis of the nanoscale transistors considering the discreteness of impurity ions and transported carriers. This paper is organized as follows: Section 2 describes the softened Coulomb potential to reproduce majority carries within the EMC/MD scheme. Section 3 explains the outline of parallel computing on GPU and GPU architecture. The result of accelerated n-type bulk EMC/MD simulation by utilizing GPU is also indicated. In Section 4, we demonstrate a full MOSFET device EMC/MD simulation including source and drain region. Section 5 is dedicated to conclusions.
EMC/MD simulation of majority carriers
In the EMC/MD method, carriers are treated as classical particles, and their real-space trajec-tories under the Coulomb point-to-point potentials are calculated by the MD algorithm. The acoustic and optical phonon scattering are described as stochastic changes in the momentum of carriers according to the standard energy-dependent formulations [6] .
Carriers and impurity ions are treated as point charges in the EMC/MD method, and thus the attractive Coulomb force between an electron and a donor ion diverges at the singular point of the zero distance [12] . This is problematic to deal with the majority carriers in the source and drain region of n-type Si MOSFET. To avoid the problem, we employed softened Coulomb potential (1) (1) where e is the elementary charge, is the permittivity of semiconductor, r is the distance between carriers and donor ions, and is the softening parameter. The determines the depth of the bottom of the Coulomb potential (Fig. 1) .
In this study, the softening parameter was decided by the low field mobility simulation of n-type Si bulk. All electrons and impurity ions are randomly placed inside the unit cell, which adopts 3D periodic boundary condition. The mobility is determined by the mean travel distance under the external field of 1 kV/cm. The low field mobility was estimated by averaging the travel distance over 100 ps. The number of electrons and impurity ions are 100 respectively. The time step of the MD run is set to 0.01 fs. Figure 2 shows the calculated low field mobility in n-type Si plotted versus the impurity concentration [13] . The softening parameter was selected to reproduce the literature data [14] .
The low field mobility cannot reproduce by a single softening parameter value for the full range of the dopant concentration. Rather it must be increased at an impurity concentration of 5 × 10 18 cm -3 or higher. This value is close to the threshold impurity concentration from non-degenerate semiconductor to degenerate one. The change in the optimal softening parameter value at higher concentrations suggests that the effect of the donor potential on a carrier is weakened significantly in the degenerate semiconductors.
Parallel computing on GPU
GPU is an auxiliary processor which is specialized for real-time rendering, and it is used under the control of the CPU. Since the rendering processing of a picture is executed by a repetition of independent simple operations, the efficiency has been promoted by the large-scale parallel computation. GPU is designed to perform the large-scale parallel operation with several thousands of integrated cores. Figure 3 shows the progress in computing performance of CPU and GPU [15] . In recent years, GPU architecture has been improved to be applied for in an n-type bulk Si [13] .
general processing not only for the graphics, which is called General Purpose computing on GPU (GPGPU). The GPGPU technique enables us to easily implement a parallel computation for various problems, which is the distinctive advantage against dedicated processors designed for limited purposes.
GPU architecture varies depending on the manufacturing vendors, and major venders have improved the architecture every year. Here, we give the outline of GK110 (Kepler) architecture of NVIDIA Corporation [16] . Figure 4 shows an overview of GPU computing and GPU architecture. Instructions and data are transmitted from CPU to GPU via the PCI Express interface. The instruction sequences sent to the GPU are delivered to each streaming Figure 3 : The progress of theoretical performance of CPU and GPU [15] .
GPU computing is to reduce the amount of the transfer data and the frequency of transfer instructions between CPU and GPU as much as possible.
In this work, the EMC/MD algorithm is fully parallelized by utilizing GPU. Single GPU thread is assigned to the calculations associated with one electron. The number of thread is equal to the number of electrons. We employed NVIDIA GeForce GTX560Ti and GeForce GTX690, and compared the execution time with the single core calculation with Intel core i7 3930k CPU.
An n-type bulk model is employed in our test EMC/MD simulation. The execution time is evaluated by changing the number of electrons with keeping the total electron density. The execution time is defined as the duration to complete the 10 5 steps calculation. Figure 5 shows the speed up rate, which is the ratio of the execution time with GPU, MDGRAPE, or
Phantom-GRAPE to that with CPU. The GPU parallel computation speed is enhanced by 10 times and more in the case of several thousand particles. In a few or several tens of particles, GPU computing gives no speed up, due to the overhead of parallelism and the poor performance of single GPU thread compared with that of the single CPU core. GTX 690 indicates better performance than GTX 560Ti, due to the difference in the GPU architecture and the number of core. Figure 5 shows also the speed up rates of the EMC/MD calculations using MDGRAPE and Phantom-GRAPE. The data of Phantom-GRAPE was carried out by T2K-Tsukuba in
Center for Computational Sciences of Tsukuba University [17] . MDGRAPE is quite effective even for the small number of particles. However, the GPU computing is a prospective technology in terms of the ability to perform general purpose calculation with very high cost-performance.
In order to maximize the parallelization efficiency, it is recommended that the number of threads is set to an integral multiple of the number of CUDA cores. The parallelization efficiency is reduced when the number of threads is not at an integral multiple of the number of cores. However, in case of smaller calculations requiring threads less than 1000, the speed up rate is not affected whether the number of threads is on or off at an integral multiple of the number of CUDA cores; the computation time is limited by the memory transfer rate rather than the performance of each CUDA core.
Device characteristics of Si NW FET
Based on the above techniques, we have performed a full MOSFET device EMC/MD simulation including source and drain region. Figure 6 shows the simulation model of gate-all-around (GAA) n-i-n Si NW model. Only conducting electrons are considered as carriers. The periodic boundary condition is adopted in the longitudinal direction of the NW. A quantum confinement potential is adopted in the cross sectional plane [4, 18] , which depends The speed up rate of EMC/MD simulation employing the n-type bulk Si model [13] .
for 50 ps with the time step of 0.1 fs. To reproduce the field effect caused by the gate electrode, many fixed charges are placed over the gate insulator layer, as shown in Fig. 6 . These fixed charges represent accumulated charges at metal/oxide interface, and they bear a fractional charge determined by the gate voltage and oxide film capacitance:
where q is the fractional charge assigned to a fixed charge on gate insulator, N is the total number of fixed charges, VG is the gate voltage with respect to the source, and COX is the ca- where LG is the gate length, ε is the permittivity of SiO2, TOX is the thickness of SiO2, and r is the radius of Si NW. To satisfy the charge neutrality condition of the whole device, the number of electron in the source and drain regions are adjusted. The enough number of the fixed charges is placed on the gate insulator layer to approximately reproduce the field effect caused by the gate electrode. The work function of the gate electrode is assumed to be identical to that of the Si channel region.
In this study, the gate voltage is simply defined as the total charges in the gate electrode divided by the capacitance of the oxide layer. In order to exactly determine the electrode voltages, the number of charges should be controlled by the grand canonical ensemble MD technique. In that case, the voltage is clearly defined by chemical potential in the electrode. In the present work, however, we employed the ad hoc definition of the gate voltage to simplify the simulation setup. is determined by
where C'OX is the oxide capacitance per unit length of NW, and V(x) is the voltage applied to the oxide film at x. We assume that V(x) is given by the following linear function:
where VD is the drain voltage, and LS is the length of the source region. Therefore, for given VG, the charge assigned to a fix particle is determined by . ) ( (6) By this change, the ID-VG characteristics in Fig.10 are shifted along the VG axis from those in 
Conclusion
We demonstrate that the EMC/MD method is successfully accelerated by GPU. The GPU parallel computation speed is enhanced by 10 times and more in the case of several thousand particles. EMC/MD simulation became possible in systems with different charges such as electron and donor ions by designing the softened Coulomb potential to solve the electron trapping phenomena. The full-scale whole device EMC/MD simulation including source and drain regions are realized by utilizing GPU. The EMD/MD powered by GPU computing opens the door to the statistical analysis of the nano-scale device characteristics, and potentially the atomistic simulation of elemental circuits such as DRAM and SRAM. 
