A Multi-level parallel simulation approach to electron transport in nano-scale transistors by Luisier, Mathieu, Purdue University - Main Campus & Klimeck, Gerhard
Purdue University
Purdue e-Pubs
Other Nanotechnology Publications Birck Nanotechnology Center
12-15-2008
A Multi-level parallel simulation approach to
electron transport in nano-scale transistors
Mathieu Luisier Purdue University - Main Campus
Purdue University - Main Campus
Gerhard Klimeck
Purdue University - Main Campus, gekco@purdue.edu
Follow this and additional works at: http://docs.lib.purdue.edu/nanodocs
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.
Luisier, Mathieu Purdue University - Main Campus and Klimeck, Gerhard, "A Multi-level parallel simulation approach to electron
transport in nano-scale transistors" (2008). Other Nanotechnology Publications. Paper 148.
http://docs.lib.purdue.edu/nanodocs/148
1A multi-level parallel simulation approach to
electron transport in nano-scale transistors
Abstract
Physics-based simulation of electron transport in nanoelectronic devices requires the solution of thousands of
highly complex equations to obtain the output characteristics of one single input voltage. The only way to obtain a
complete set of bias points within a reasonable amount of time is the recourse to supercomputers offering several
hundreds to thousands of cores. To profit from the rapidly increasing availability of such machines we have developed
a state-of-the-art quantum mechanical transport simulator dedicated to nanodevices and working with four levels of
parallelism. Using these four levels we demonstrate that an almost ideal scaling of the walltime up to 32768 processors
with a parallel efficiency of 86% is reached in the simulation of realistically extended and gated field-effect transistors.
Obtaining the current characteristics of these devices is reduced to some hundreds of seconds instead of days on a
small cluster or months on a single CPU.
I. INTRODUCTION
The aggressive semiconductor device scaling has led to transistors containing less than ten thousand atoms in the
active region. This concerns for example the silicon-on-insulator ultra-thin-body (UTB) and multi-gate nanowire
(NW) field-effect transistors (FETs) currently investigated in research labs and universities[1], [2], [3]. According
to the International Road Map for Semiconductors [4] these devices will run into production in the middle of next
decade. The real challenge is to make these devices operate not “in spite of” but “because of” the quantum effects
inherent to the nano world.
Technology computer aided design (TCAD) can accompany this effort if classical concepts such as drift-diffusion
models are replaced by quantum-mechanical approaches capturing the entire bandstructure of a crystal, the atomic
granularity of the simulation domain, and the non-equilibrium condition of the problem. The semi-empirical tight-
binding method fulfills these requirements and therefore has become more and more popular among the TCAD
community[5]. In tight-binding each atom as well as the connections between them are taken into account and
characterized by a matrix whose size reflects the complexity of the bandstructure model. We opted for the nearest-
neighbor  	 
 tight-binding model where ten (without) or twenty (with spin-orbit coupling) orbitals per atom
are retained[6]. The model properly represents the behavior of the constituent baseline materials for state-of-the-
art semiconductors such as Si, Ge, GaAs, InAs, AlAs with respect to all critical bandgaps, masses, and stress
deformations.
April 14, 2008 DRAFT
2Since a realistic two-dimensional device contains at least five thousand atoms the        
 tight-binding model
leads to sparse linear systems of fifty thousand of unknowns or more that have to be solved for each injection
momentum (typically 10 to 20) and energy (from 500 hundred to 2000 points). This procedure is repeated for each
input voltage and eventually for several different transistor structures. Assuming that 10 seconds are needed to solve
one momentum-energy combination, 14 hours is the lowest limit to treat one bias point, without even considering
the self-consistency of the electrostatic potential and carrier density. Until recently it was not clear whether a multi-
dimensional and quantum-mechanical simulation of nanodevices was possible without neglecting some important
characteristics like the full bandstructure (effective mass approximation) or the unity of the simulation domain
(mode space approximation)[7], [8].
Our quantum transport simulator, OMEN, addresses these issues. It solves two- and three-dimensional Schro¨dinger
equations in the        
 atomistic basis[9]. Special attention has been paid to the calculation of the open-boundary
conditions (OBCs) which may severely limit the size of the simulation domain if it is not optimized. The usual
iterative solutions[10] and generalized eigenvalue problems[11] have been abandoned in favor of a shift-and-invert
procedure[12]. The Non-equilibrium Green’s Function (NEGF)[13] and the Wave Function (WF)[14] formalisms
have been implemented to solve the transport problem and to obtain density-of-states, transmission coefficients,
carrier, and current densities. To get insight into the ballistic properties of a device where no incoherent scattering
is considered the WF formalism is more suited than NEGF since it requires less operations[15].
The large supercomputer resources that are nowadays available have led us to incorporate four levels of parallelism
into OMEN and to specifically design it for distributed memory machines. The highest level consists in distributing
the input voltages among different processor sub-groups. This is an embarrassingly parallel workload since each
bias point is independent from the others. The next two levels deal with the injection momentum and energy. Each
configuration of these two quantities results in a large sparse linear system whose solution has to be integrated
once all the points are calculated. Inter-processor communication is only required at the final summation stage. The
fourth level of parallelization concerns spatial domain decomposition. Depending on the dimensions of the device,
it is not always possible to solve the sparse linear systems on a single core. In this case the simulation domain is
distributed on many processors which have to exchange information frequently.
The parallelization of momentum, energy, and spatial domain decomposition alone scales almost perfectly to
4096 cores. For a reasonable number of 50-200 bias points in transistor performance curves (output  and
transfer  characteristics) OMEN would scale to at least 200’000 cores in quad-level parallelism. Here,
we demonstrate the capability of the quad-level parallelism on Ranger the first NSF Track 2 system[16], [17] up
to 32768 processors. This unique performance proves that supercomputers will definitively build the core of next
generation device simulators.
The paper is organized as follows. In Section II the physical models that govern OMEN are reviewed and the
principal equations to solve are highlighted. The implementation of the simulator is described in Section III and
its computational performance is presented in Section IV. Finally, the paper is summarized in Section V and an
overview of further improvement is given.





















Fig. 1. Schematic view of a gate-all-around nanowire (left) and double-gate ultra-thin-body (right) field-effect transistor. The transport direction
 is aligned with the ff 100 fi crystal axis here, fl is a direction of confinement, while ffi is a direction of confinement for the NW structure and
is assumed periodic for the UTB. The source, the drain, and the gate regions have a length ! , " , and # , respectively. Red dots represent
atoms, the yellow layers surrounding the atomic channel are oxide layers represented as continuous dielectrics.
II. PHYSICAL MODELS
OMEN is a three-dimensional (3D), full-band, and atomistic simulator designed for multi-gate devices as illus-
trated in Fig. 1. For convenience the transport direction is assumed to be aligned with the $ axis, % is a direction
of confinement, and the confinement or periodicity of & depends on the device type. For nanowires & is a second
direction of confinement. In 2D ultra-thin-body transistors & is periodic so that the value of any function ')(*$!+%+,&.-
representing the internal states of the device obeys the following rule
')(/$0+%+1&2435-76 89;:<(*=?>A@B3C-ED	')(*$!+%+,&.-1+ (1)
where >A@ is a parameter valued between - F and F . The induced >.@ -dependence of the internal variables is integrated
to obtain carrier and current densities. Without periodicity in & , >;@ is equal to 0, the simulation domain is closed
along this direction, and no structural code modification is needed.
To solve the Schro¨dinger equation in the WF or NEGF formalism, electrons with a momentum >;@ and an energy
G
are injected and collected at the source and drain contacts of the device. There is no injection from the gate






















In Eq. (2) the Schro¨dinger equation is expanded in the nearest-neighbor        
 tight-binding orbital basis so that the
vector M J (X>S@A+
G




- associated with an orbital \I]Q^   , _ ,` , @ ,   
 , _	` , ` @ , A_ @ ,









belonging to an atom situated at the position (d%+1&- inside a slab = . Here, a slab denotes the smallest
group of atomic planes (atoms with the same $ coordinate) so that the block tri-diagonal form of Eq. (2) is ensured.
The atomic topology in each slab and the number of layers is determined by the crystal orientation of the transport
direction. For example when $ is aligned with the e 100 f , e 110 f , or e 111 f crystal axis, a slab contains one
single atomic plane, but for $ = e 112 f a slab is composed of four atomic planes. If there are gih atoms in the
slab = the size of the vector MjJ,(
G




pH5JKJ represent the on-site energy and the bond connections for all the atoms in the
slab = , while the H JRJrqQP represent the coupling to the nearest-neighbor slabs. Equation (2) has to be solved for
each slab index = . The matrices H P?s and H)tNt OQP couple the device to its semi-infinite contacts represented by the
self-energies u PP and uvtNt that are added to H PP and H)tNt , respectively. Different approaches exist to calculate
the open-boundary conditions (OBCs). We will only mention and briefly describe three of them
w iterative solution[10]: an equation of type ux6zyE({uj- is iteratively solved until convergence is achieved. Full
matrices have to be inverted.
w generalized eigenvalue problem (GEVP)[11]: given a momentum >.@ and an energy
G
the momentum of the
injected states > _ is calculated using a generalized eigenvalue problem |~}6} based on the Bloch
condition that reigns in the semi-infinite contacts.
w shift-and-invert procedure[12]: starting from the same postulate as for the GEVP the matrix A is shifted and
inverted (*|C- T!P }6Ł;(/U-?} . Due to its particular structure, the inversion of (*|C- is straight-
forward and the eigenvalues of only a small block of (/|pIC- T0P  need to be computed.
The shift-and-invert approach is about ten times faster than the two other methods[15] and is therefore always used
in OMEN for the calculation of the open-boundary conditions in UTB and NW FETs.
After the computation of u PP and ujtjt , a matrix containing Eq. (2) for each slab = is assembled and solved








where the multiple right-hand-side vector (X> @ +
G
- includes the C (  ) states injected from the left (right)












where  is the identity matrix[12]. If there are 5 atoms in the nanowire, the Hamiltonian  is a block-tridiagonal
matrix (with sparse blocks) of size 5loniklon as the boundary self-energy u . The first and last diagonal blocks
of this matrix are full and contain u P,P and uvtNt , respectively. The matrix 	.(X>A@S+
G
- is of size   l n kC(/  2  - .
Because the symmetry properties of the boundary self-energy u and of the Hamiltonian  are different the linear
system of equations above is complex, non-symmetric, and non-hermitian. OMEN is able to solve both Eq. (3) and
April 14, 2008 DRAFT
5(4), but for ballistic transport (no incoherent scattering) the Wave Function approach is much more efficient and is
preferred to NEGF[9], [15].
Equation (3) is evaluated for all the ( > @ ,
G





are used to determine the electron g (d¡¢- and hole  (d¡¢- density as well as the current S(d¡¢- . The momentum and


























The definitions of the functions y ¨ (d¡[> @ +
G
- , y	©(d¡[> @ +
G
- , and y ª (*¡[1> @ +
G
- can be found in Ref. [9].
Finally, the charge density is self-consistently coupled to the electrostatic potential of the device. For that purpose
the 2D and 3D Poisson equations are solved on a finite element grid and have the following form
«
D	¬ 6 ­i+ (6)
where
«
is the stiffness matrix of the domain, ¬ the electrostatic potential at each node position, and ­ a source
term[18]. The potential ¬ is not directly calculated from Eq. (6), but by applying a Newton scheme to take care of











The size of the linear system in Eq. (8) is much larger than that of Eq. (3) and (4) because (1) additional
discretization points are interleaved between the atoms and (2) the oxide layers are not taken into account in
the transport calculation, but in the Poisson equation. The flow chart of OMEN for the computation of one bias
point is summarized in Fig. 2.
III. SIMULATOR IMPLEMENTATION
The quantum-mechanical simulator OMEN is written in C++ and is specifically designed for distributed memory
computers using MPI[20] as the communication protocol. All algebra operations involving vectors and full matrices
are performed with highly-optimized Level 1, 2, and 3 BLAS functions like (d,z)axpy, (d,z)gemm, or (d,z)gemv[21].
In its task flow depicted in Fig. 2 four loops can be identified, over the bias points (dashed blue), over the self-
consistent calculation of the electrostatic potential (solid orange), over the momentum (dashed-dotted green), and
over the energy (red with circles). The core of the simulator is the computation of the open-boundary conditions as
described in Section II and the solution of Eq. (3) that are common to all the loops. In a typical simulation where the
transfer characteristics of a device is requested, for a given drain-to-source voltage ! the gate-to-source voltage
 is ramped from 0.0 V to the power supply voltage 0² ² in 10 to 20 steps. For each ( U+³ ) the calculation of























Fig. 2. Flow chart of OMEN. For a given bias point
(dashed blue loop) Eq. (3) has to be solved for each
( ´	µ , ¶ ) configuration (solid red loop with circles and
dashed-dotted green loop) and the carrier density and the
electrostatic potential ·v¸ .¹ fl ¹ ffi	º are updated by solving
Poisson equation till self-consistency (SC, solid orange
loop) is reached.
kz1














Fig. 3. Parallel scheme of OMEN. In Fig. 2, the dashed blue (bias points), the dashed-
dotted green (momentum), and the red with circle (energy) loops are parallelized and
form the three highest levels of parallelism. From the total number of cores »½¼ there
are » ¾¿ cores operating per bias point, » À
µ
per momentum point, and » Á per energy
point. If Â Á fi 1 (here 4) the simulation domain is spatially decomposed (fourth level
of parallelism).
a self-consistent electrostatic potential takes 3 to 10 iterations. The momentum and energy dependence require 10
to 20 and 500 to 2000 points, respectively. Considering the lowest limit case the boundary conditions and Eq. (3)
have to be solved 150’000 times (10 k 10 k 500 k 3). A process time of 10 seconds per iteration corresponds to a
total simulation time of 17 days on a single computer.
Obviously, to reduce this time the loops over the bias points, the momentum, and the energy can be parallelized.
In effect the calculation at   P does not depend on the calculation at  Ã . Similarly, the solution of Eq. (3) for > @ P
and
G
P is independent from the solution for > @1Ã and
G
Ã , but the results have to be stored to evaluate Eq. (5). The
loop over the self-consistent calculation cannot be parallelized since the current electrostatic potential is obtained
from the previous iteration and no simultaneous distribution of the tasks is possible. As an additional fourth
parallelization level the simulation domain can be spatially decomposed over different processors. The resulting
simulator capabilities are shown in Fig. 3. It is a direct translation of the flow chart in Fig. 2 into a multi-level
parallel hierarchy.
If a device simulation is launched on Äs cores the distribution of the tasks happens as follows
w each of the 5Å
¿
bias points defined by the user is treated independently and has Ä5Å
¿
(user input) cores that




= Æ s . During the initialization phase groups of communicators are built using
April 14, 2008 DRAFT
7the “MPI Group incl” and “MPI Comm create” commands. Two groups È¨ and ÈÉ will never exchange any
information during the simulation time and run in an asynchronous way (embarrassingly parallel workload).
Note that the number of bias points per group is not limited to a single one.
w the ÆiÅ ¿ cores dedicated to one or several bias points are subdivided into )Ê@ groups of ÄÊË cores with
 Ê ËkÌÆ Ê Ë = Æ ÅU¿ . The number of > @ points is selected by the user to capture all the features of the contacts
bandstructure[12]. Currently, the number of processors per > @ point Æ Ê Ë is allocated by the user and not
dynamically. This will be improved in the future to avoid that all the groups have to wait till the last one has
completed its task. Typically, the amount of work per group has a standard deviation of 15% from its mean
value.
Each member of a group of ÆiÊ@ processors shares the same electrostatic potential, carrier density, and source,
gate, and drain voltages. The integration over >.@ in Eq. (5) is synchronous and accomplished once the )Ê@
groups have finished their job by calling the “MPI Allreduce” function with “MPI SUM” as argument. Three-
dimensional structures, like nanowires, needs a specific treatment. The absence of periodic boundary conditions
along the & -axis is accounted by setting  Ê@ =1 and > @ =0. Hence, the maximum number of parallelization levels
in OMEN is reduced to three for 3D nanowires FETs.
w each group of Æ Ê Ë cores has to compute CÍ energy points. However, the energy grid is an adaptively refined
mesh whose number of elements  Í depends on the value of >@ . Each group member in this parallel level
has a common phase 89;:<(d=o>@U35- and solves g!Î =  Í / ÆLÊË energy points if the spatial domain decomposition
is switched off and g!Î =  Í Ł¢ÆLÊËk4Æ Í points if it is on. The parameter Ä Í is the number of cores that
simultaneously work on the same boundary conditions and on the same Eq. (3).
A straight-forward way to distribute the work load is to assign the gLÎ first energy points to the first processor,
the g!Î next ones to the second processor, and so on. This is an inefficient distribution as illustrated in Fig. 4.
In effect the number of propagating states that are injected into the device from the left  and the right C
contacts varies with energy. The time to factorize the matrix in Eq. (3) remains the same, but because the
number of right-hand-sides 552Ï increases as function of the energy the time to solve the linear system
also increases. Therefore, the group dealing with the lowest energies is completed much before the group
dedicated to the highest energies. For that reason the distribution mechanism has been improved to make the










Ã¨SÓ , and so on. Due to the continuity of the







similar and the work load is leveled better. Each processor in the group does a local energy integration of
Eq. (5) over the g<Î points it computes.
w if the size of the matrix ( Ç(/> @ -;u(/> @ +
G
- ) in Eq. (3) or Eq. (4) is too large, if the memory per processor
is too low, or if for a given Æ Å	¿ Æ Ê Ë = Æ ÅU¿ Ł Ê Ë is comprised between 5Í½ŁBÔ and CÍ (see Section IV) it is
convenient to assign ÆEÍ processors to a single energy point and to apply a spatial domain decomposition
to the device. Then, the advantage of Eq. (3) over Eq. (4) becomes clear: there are several parallel direct
sparse linear packages that can handle Eq. (3) like PARDISO[22] (shared memory), SuperLU  J 
Ñ
2.0[23],
April 14, 2008 DRAFT
8MUMPS 4.7.3[24], or a recently developed basis compression algorithm[25] (all distributed memory) while
the NEGF problem in Eq. (4) does not lend itself to efficient parallelization[26], [27]. Extensive studies[9],
[15], [25], [28] including also the sequential solver Umfpack 5.0.1[29] have shown that for ballistic transport
(1) even on a single processor the solution of the NEGF problem is about 5-10 times slower than the fastest
solver, (2) the basis compression algorithm is the most efficient solver in terms of performance on a single
processor and parallel scalability, but only for 3D structures and under certain transport directions ( $ = e 100 f
and $ = e 111 f ), and (3) in general MUMPS and Umfpack are not as sensitive as SuperLU and PARDISO to the
transport direction. Other solvers like Spike[30] are currently under investigation. It is worth noting that what
makes Eq. (3) computationally intensive is not really the size of the linear system, but its bandwidth. A matrix
with a size  =2’000’000 and a bandwidth ÕÖe 1000 is easily manageable while a matrix with  =500’000 and
Õ =4000 as produced by 3D nanowires with a 5 k 5 nm Ã cross section represents a real challenge.
Another important issue of the spatial domain decomposition is the calculation of the open-boundary conditions.
As mentioned previously, two contacts are taken into account, the source and the drain and an eigenvalue prob-
lem is solved for each of them by calling the LAPACK[31] functions dgeev and zgeev. Since ScaLAPACK[32]
does not provide a parallel version of these two functions (pdgeev and pzgeev) the only way to parallelize the
calculation of the OBCs is to attribute the source contact to one processor and the drain to another. Only two
processors can be used in the OBCs calculation, others allocated to the domain decomposition will remain
idle during this phase and the parallel efficiency is reduced.
The loop over the self-consistent calculation of the electrostatic potential cannot be parallelized, but Eq. (8) can
and must. The momentum, energy, and domain decomposition parallelization can reduce the time to compute charge
and current densities in Eq. (5) to ×5(/ Ã - seconds. This is the order of magnitude of the Poisson equation if a
sequential and direct sparse linear solver is used. To prevent OMEN from this problem Eq. (8) is solved with a
parallel iterative solver, Aztec[33]. Not all the Æ½Å
¿
processors assigned to a bias point are concerned, but a number
comprised between 1 and 64 and defined by the user.
IV. RESULTS
The performance of OMEN is analyzed for the simulation of two realistic devices run on Ranger the first NSF
Track 2 Supercomputer located at the Texas Advanced Computer Center[16]. Each compute node contains four
AMD Opteron Quad-Core 64-bit processors (16 cores in all, core frequency of 2.0 GHz) on a single board and
32GB of memory, i. e. approximatively 2GB per core if the 16 cores per node are used. The transfer characteristics
Ø at a fixed drain-to-source voltage is computed. This is a very important quantity for circuit designers.
According to the ITRS[4] transistors are specified among other properties by their subthreshold swing Ù , their ON-
and OFF-current Ú t and ÚLÛ!Û , their threshold voltage 
Ñ*Ò
, and their channel mobility. All can be extracted from
the simulated     characteristics as illustrated in Fig. 5 for a UTB structure.
A device simulator is evaluated with respect to two criteria, the accuracy of the physical models, and the time
required to obtain observable data, like the transfer characteristics. The physical models have been treated in
April 14, 2008 DRAFT

























≥ 2 and N
R
≥  2
Fig. 4. Transmission coefficient through an UTB field-effect transistor ( · "
 
=1.0 V, ·# =0.4 V) as function of the energy. A transmission
Ü
¸r¶½º = Ý means that at least Ý states are injected from the left and the right contacts, i. e. ÞNß5àÝ and Þ áàÝ . A transmission equal to
0 reveals that states are injected only from one contact and no state is available in the other contact. Hence, the size of âãä¸K´¢µ ¹ ¶½º in Eq. (3)
is energy-dependent.
Section II and recognized by several publications[9], [12], [15], [34], [35], [36], [37]. The computational efficiency
is discussed in this Section.
A. Gate-All-Around Nanowire Field-Effect Transistor
The first benchmark structure is a silicon gate-all-around (GAA) circular nanowire field-effect transistor as
depicted in the left part of Fig. 1. Such ultrascaled devices have already been demonstrated by research groups[3].
The Si channel has a diameter of  =3 nm and is surrounded by oxide layers of thickness l1å _ =1 nm. The source
and the drain extensions measure æv = æ½ =10 nm, the gate æ½ =15 nm. Since the & -axis is a direction of confinement
no periodic boundary conditions are applied to the structure, the number of >;@ points ÊË is equal to 1, and only
three levels of parallelism are possible (bias, energy, and spatial domain decomposition). In other words the number
of processors per > @ point is always equal to the number of processors per bias point Æ Ê Ë = Æ Å	¿ . Furthermore, the
number of self-consistent Poisson iterations is limited to 2 in this timing experiment.
Figure 6 shows the transfer characteristics of the GAA NW transistor (inset, red curve) and the simulation time
on Æ s cores comprised between 16 and 4096 (blue and green curves). The gate voltage is ramped from 0.0 V to 0.75
V in steps of 0.05 V and 16 bias points are simulated. The energy domain counts about 540 points. The calculation
of Eq. (3) is accomplished with the basis compression algorithm[25] and requires more than 2GB of memory for
high-energy points. Hence, spatial domain decomposition on Æ Í =2 (blue curve with squares) and ÆiÍ =4 (green
curve with triangles) CPUs becomes necessary. From 16 to 256 cores Æ Å	¿ is equal to ÆQs (one bias point treated at
the same time) then, from 512 to 4096 cores Æ ÅU¿ remains equal to 256 and ÆLs / Æ ÅU¿ bias points are simultaneously
computed. The simulation times and relative speed-up factors measured with respect to Æ s =16 and Æ Í =2 are also
reported in Fig. 7.
April 14, 2008 DRAFT
10












































Fig. 5. Transfer characteristics ç "jè ·B# ( · "
 
=1.0 V, blue line with circles) of a Si UTB FET as schematized in Fig. 1 ( é*ê/ë "ì =4.9 nm,
 = " =10 nm, # =22 nm,  = ff 100 fi , fl =(100)). Important design parameters like subthreshold swing í , ON-current çîï , OFF-current
ç
îðð , threshold voltage ·¢ñrò , or mobility upper limit óAô³õRö can be extracted from the simulation data.
















Spatial Decomp. on 2 CPU
Spatial Decomp. on 4 CPU













Fig. 6. OMEN simulation time for the transfer characteristics ç "vè
·B#, ( ·B"
 
=0.6 V, red curve with circles in inset) of a GAA NW field-
effect transistor (diameter ÷ =3 nm, source and drain length   =  " =10
nm, gate length # =15 nm). Three levels of parallelism are used: bias
points (16 points here), energy points ( ø 540 points), and spatial domain
decomposition ( Â;Á =2, blue curve with squares and ÂÁ =4, green curve




Â;Á =2 (s) Speed-Up ÂÁ =4 (s) Speed-Up
16 64373 1x/1x 85070 0.76x/1x
32 33071 1.9x/2x 43055 1.5x/2x
64 17100 3.8x/4x 21827 2.9x/4x
128 9103 7.1x/8x 11205 5.7x/8x
256 5028 13x/16x 5916 11x/16x
512 2531 25x/32x 2968 22x/32x
1024 1241 52x/64x 1470 44x/64x
2048 631 102x/128x 741 87x/128x
4096 323 199x/256x 378 170x/256x
Fig. 7. Parallel execution time (in seconds) and relative speed-up factor
corresponding to the results shown in Fig. 6. The reference for the
relative speed-up factor is the simulation time at Â
¼
=16 and ÂÁ =2.
The first number in the speed-up columns is the measured speed-up, the
second number the expected ideal value. The first column of the table
refers to the total number of CPU used in the simulation ( Â ¼ ), columns 2
and 3 to ÂÁ =2, columns 4 and 5 to ÂÁ =4. The number of CPU per bias
point Â¾¿ is first increased from 16 to 256 and then remains constant.
April 14, 2008 DRAFT
11
Transport Problem Poisson Equation
Â;Á Þ ÞÞúù û % OBCs (s) Eq. (3) (s) Â Þ ÞÞúù Eq. (8) (s)
1 out of memory 16 76702 1570556 0.75
2 123520 4493122 1070 3.4 12.3 15.6-18.4 32 76702 1570556 0.4
4 123520 4493122 1070 3.4 9.4 8.3-10.3 64 76702 1570556 0.2
Fig. 8. Time to compute the open-boundary conditions and Eq. (3) (transport problem, left part of the table) and to solve Eq. (8) (Poisson
equation, right part of the table) for the NW structures whose results are shown in Fig. 6 and 7. The basis compression algorithm[25] is employed
for Eq. (3) in column 7, where the first number indicates the factorization and solve time for low-energy points and the second number for
high-energy points. The variable Þ = Þýü0éþê is the size of the matrix ¸Kß è Ïè º , û its bandwidth, ÞÞúù the number of non-zero elements,
and % the sparsity of the band. The same convention applies to the data of Eq. (8) where Þ is the size of the Jacobian matrix  and Â the
number of processors invoked by the iterative linear solver Aztec[33] .
Ideally the blue and green curves in Fig. 6 would lie on top of each other since the time to compute the open-
boundary conditions and to solve Eq. (3) should be two times faster with Æ½Í =4 than with ÆEÍ =2. This is not the case
as can be seen in Fig. 8, columns 6 and 7. The impossibility of parallelizing the boundary conditions on more than
two cores and the non-ideal parallel efficiency of the sparse linear solvers explain this behavior. Therefore, when
Æ
Í =4, two processors out of four are idle during the OBCs calculation phase. Hence, on Ranger where 16 cores
per nodes are available only half of them will work and their performance increases as shown in Fig. 8 column 6.
The scaling properties of OMEN for this realistic NW structure are very close to the ideal case. The central
part of the simulator, the OBCs calculations and the solution of Eq. (3) must be performed 17280 times with an
average duration of 29 seconds ( Æ Í =2). This represents about 6 days on 2 cores. Note that the time to solve Poisson
equation is negligible (see Fig. 8 for more details). Now the same simulation that lasts about 18 hours on 16 cores
is performed in 5 minutes on 4096 cores yielding a relative speed-up factor of 199 k (256 k expected) in going
from 16 to 4096 processors. To obtain a better speed-up factor less processors per >@ point ÆEÊË should be used. In
effect with ÆEÊË =256, Æ Í =2, and  Í 540 few cores handle five energy points while the majority deals with four
energy points. If  s is the time for one energy point (assumed equal for all of them) the total time becomes 5 k s .
With ÆEÊË =128 the total time is 9 k s . The speed-up factor in going from ÆiÊË =128 to ÆEÊË =256 is only 9/5=1.8
instead of 2.
We note here that some devices that are subject to atomic disorder such as alloy compounds or interface roughness
may require a significantly larger number of energy points (  Í	 1500). In those cases good scalability to even
larger core counts can be achieved on the energy parallel level.
B. Double-Gate Ultra-Thin-Body Field-Effect Transistor
The second example is a silicon double-gate ultra-thin-body field-effect transistor designed according to the ITRS
specifications for the 22 nm technology node. A schematic view of the device structure can be found on the right
part of Fig. 1. The Si body has a thickness ln å ` =4.9 nm, the oxide layers surrounding it l å _ =1.3 nm. The length
of the source and drain contacts is æj = æý =10 nm, of the gate æ½ =22 nm. The & -axis of the structure is assumed
periodic and is modeled by 5ÊË =16 momentum points treated simultaneously. The energy grid is composed of
April 14, 2008 DRAFT
12
















1 Bias Point, 1 Poisson Iteration
8 Bias Points, 2 Poisson Iterations













8 Bias Points, 2 Poisson Iterations
16 Bias Points, 3 Poisson Iterations
Ideal Scaling
1 Bias Point, 1 Poisson Iteration
Fig. 9. Scaling example of OMEN for the simulation of the transfer characteristics ç" è ·# ( ·"
 
=1.0 V) of a UTB field-effect transistor
( éþê/ë "ì =4.9 nm,   =  " =10 nm,  # =22 nm,  = ff 100 fi , fl =(100)). The four levels of parallelism are taken into account, ÞNÀ,Ë =16 momentum
points and Þ Áø 1400 energy points are computed per bias point. Three cases are investigated (1) 1 bias point with 1 Poisson iteration, Â À Ë is
ramped from 1 to 256, Â¾ ¿ = Þ À,Ë
Â;À,Ë ( Â;Á =1 blue curve with squares, ÂÁ =2 green curve with triangles), (2) 8 bias points with 2 Poisson
iterations, ÂÀ,Ë is ramped from 8 to 128 and then remains at that value, Â¾¿ = Þ À1Ë
jÂ;À,Ë ( Â;Á =1 red curve with circles, Â;Á =2 cyan curve with
stars), and (3) 16 bias points with 3 Poisson iterations, 16  Â À Ë 128, Â ¾ ¿ = Þ À Ë
úÂ À Ë , and Â Á =2 (violet curve with cross).
1 Bias Point / 1 Poisson Iteration 8 Bias Points / 2 Poisson Iterations 16 BP/ 3 PI
Â
¼
Â;Á =1 s. Speed-Up ÂÁ =2 s. Speed-Up Â;Á =1 s. Speed-Up ÂÁ =2 s. Speed-Up ÂÁ =2 s. Speed-Up
16 17798 1x/1x - - - - - - - -
32 8982 2x/2x - - - - - - - -
64 4522 3.9x/4x - - - - - - - -
128 2277 7.8x/8x 3010 5.9x/8x 39088 1x/1x 50324 0.8x/1x - -
256 1141 15.6x/16x 1511 12x/16x 19544 2x/2x 25157 1.6x/2x 75657 1x/1x
512 583 30.5x/32x 760 23x/32x 9868 4x/4x 12672 3.1x/4x 38314 1.97x/2x
1024 291 61x/64x 385 46x/64x 5023 7.8x/8x 6383 6.1x/8x 19176 3.94x/4x
2048 150 119x/128x 198 90x/128x 2630 15x/16x 3251 12x/16x 9813 7.7x/8x
4096 85 210x/256x 100 178x/256x 1312 30x32x 1626 24x/32x 5011 15x/16x
8192 - - - - 651 60x/64x 809 48x/64x 2442 31x/32x
16384 - - - - 320 122x/128x 402 97x/128x 1281 59x/64x
32768 - - - - - - - - 687 110x/128x
Fig. 10. Parallel execution time (in seconds) and relative speed-up factor for the examples shown in Fig. 9. The relative speed-up factor refers
to the simulation time obtained with Â
¼
=16 and Â;Á =1 in the first test, Â
¼
=128 and Â;Á =1 in the second experiment, and Â
¼
=256 and ÂÁ =2
in the last test. The first entry in the speed-up columns is the measured value, the second entry the expected ideal value.
Transport Problem Poisson Equation
Â;Á Þ ÞÞúù û % OBCs (s) Eq. (3) (s) Â Þ ÞÞù Eq. (8) (s)
1 55440 1971692 380 9.3 2.7 7.6-11.8 16 83536 1690726 1.2
2 55440 1971692 380 9.3 1.35 4.5-8.4 32 83536 1690726 0.66
4 55440 1971692 380 9.3 1.31 3.8-6.0 64 83536 1690726 0.35
Fig. 11. Same as Fig. 8, but for the UTB structure in Fig. 9 and 10. Eq. (3) is solved with MUMPS[24].
April 14, 2008 DRAFT
13
a maximum of about 1400 points and an average of 1160 points per momentum. Equation (3) is solved with
the parallel sparse linear solver MUMPS[24] enabling spatial domain decomposition. Therefore, the four levels of
parallelism of OMEN are potentially applicable.
Three different tests are run and the results are shown in Fig. 9 and summarized in Fig. 10. In the first test
only one bias point is computed (   =0.0 V and   =1.0 V, three levels of parallelism) on 16 to 4096 cores with
a single Poisson iteration, ÆiÍ =1 (blue curve with squares), and ÆiÍ =2 (green curve with triangles). The number
of cores per > @ point Æ Ê Ë is increased from 1 to 256, Æ ÅU¿ =  Ê ËCk Æ Ê Ë from 16 to 4096. Almost perfect scaling
is observed on the interior three levels of parallelism with a parallel efficiency of 210/256=82%. A full sweep of
(  ,  ) would require 50-200 independent voltage points driven by the outer voltage parallelism without loss of
parallel efficiency. In the two remaining experiments we will only let the gate voltage   vary and consider 8 and
16 points up to 16384 and 32768 processors, respectively. We conclude therefore that OMEN could scale to over
200’000 cores if such a machine were available.
In the second case the gate voltage is ramped from 0.0 V to 1.0 in 8 points, all the levels of parallelism are turned
on, and 128 to 16384 cores are used, once with Æ½Í =1 (red curve with circles) and once with ÆiÍ =2 (cyan curve
with stars). The parameter Æ Ê Ë is equal to 8 for ÆLs =128, rises up to 128 for ÆLs =2048, and does not vary any more
for Æ s 2048. Finally, during the third test 16 gate voltages are computed between 0.0 V and 1.0 V, the number
of processors Æ s increases from 256 to 32768, the simulation domain is spatially decomposed on 2 CPUs ( Æ Í =2),
and 3 Poisson iterations are taken into account (violet curve with cross). The resulting transfer characteristics is
shown in Fig. 5 to illustrate the relevant design parameters.
The open-boundary conditions and the sparse linear system in Eq. (3) have to be solved 22’400 times for the first
test, 358’400 for the second, and 1’075’200 for the last one. Each iteration takes approximatively 12.5 seconds if
ÆEÍ =1 (no spatial domain decomposition), 8 seconds if Æ½Í =2, and 6 seconds if ÆEÍ =4 as reported in Fig. 11. On a
single processor the third test would last more than five months. No one in practical device simulation can wait such
a long time to obtain one curve. The fabrication and the characterization of the device would even take less time.
However, on Ranger and using 32768 cores, the five months are reduced to 11 minutes. This is possible because of
the quasi-ideal scaling performances of OMEN. For example in Fig. 10 one sees that the relative speed-up factor
obtained in going from Æ s =256 to Æ s =32768 cores is equal to 110 (128 expected) which corresponds to a parallel
efficiency of 86%.
C. Floating Point Operation Count
Another important measure of computing’s performance is the floating point operations per second (FLOP/s).
The CRAY high performance computers offer a tool called craypat to evaluate the FLOP/s in application codes.
Using the machine number 84 in the Top500 list of supercomputers[38] a rate of about 2.1 GFLOP/s on a single
core is obtained to calculate the open-boundary conditions and to solve Eq. (3). The utilized processors are AMD
Opteron with a core frequency of 2.6 GHz, 2 floating point operations per cycle, and a peak performance of 5.2
GFLOP/s. Hence, OMEN runs at about 40% of the machine peak performance. Assuming that our simulator keeps
April 14, 2008 DRAFT
14
the same FLOP/s rate per core on Ranger the simulation with 32768 cores reaches a peak of 69 TFLOP/s. With
the introduction of dual pipelines in the AMD Barcelona architecture the peak performance of the 2GHz cores on
Ranger amounts to 8GFLOP/s per core so that we can even expect a rate higher than 69 TFLOP/s.
V. CONCLUSION AND OUTLOOK
We have presented an atomistic quantum-mechanical nanodevice simulator for 2D and 3D structures like UTB and
NW field-effect transistor. An almost ideal scaling of the computational time up to 32768 cores with 86% efficiency
has been achieved and has allowed the simulation of realistically extended devices in a couple of minutes instead
of weeks. This performance opens now a wide range of possible applications. For instance the optimal design
of a transistor can be studied by simulating various structures with different dimensions, cross sections, crystal
orientations, and doping concentrations. Furthermore, random effects like interface roughness, alloy disorder, or
doping position that require a large set of results to be significant can be investigated in a reasonable amount of
time. This is the benefit of the four levels of parallelism.
To improve the parallelization of the open boundary-conditions calculation it is considered to compute them for
all the energy points first, store them on the machine scratch, and reload them prior to solve Eq. (3). Hence, no
processor remains idle in the case of spatial domain decomposition. Another issue is the efficiency of the parallel
linear solvers. For the moment only direct solvers have been tested that proceed to a LU decomposition of the
system of equations. As an alternative iterative solvers could be used[39]. They rely on matrix-vector multiplications
that can be easily parallelized, but they are not well-adapted to multiple right-hand-side problems. Finally, an hybrid
implementation of OMEN combining distributed memory and shared memory parallelization is under investigation.
The relevance of such a model comes from the architecture of the modern supercomputers containing many cores
per node. It is computationally very intensive to start MPI tasks on 32’000 cores. It could be more efficient to start
only one MPI task per node and launch threads on the remaining cores.
VI. ACKNOWLEDGEMENT
We would like to thank the Texas Advanced Computer Center staff for helping us improving our code and for
providing us a large access to Ranger, their NSF Track 2 system. This work was partially supported by NSF grant
EEC-0228390 that funds the Network for Computational Nanotechnology, by NSF PetaApps grant number 0749140,
and by NSF through TeraGrid resources provided by TACC.
REFERENCES
[1] B. Doris et al., “Extreme scaling with ultra-thin Si channel MOSFETs”, IEDM Tech. Dig. 2002, 267-270 (2002).
[2] Y. Cui, L. J. Lauhon, M. S. Gudiksen, J. Wang, and C. M. Lieber, “Diameter-controlled synthesis of single-crystal silicon nanowires”,
Appl. Phys. Lett. 78, 2214 (2001).
[3] S. D. Suk et al., “Investigation of nanowire size dependency on TSNWFET”, IEDM Tech. Dig. 2007, 891-894 (2007).
[4] http://www.itrs.net/reports.html
[5] J. C. Slater and G. F. Koster, “Simplified LCAO Method for the Periodic Potential Problem”, Phys. Rev. 94, 1498-1524 (1954).
April 14, 2008 DRAFT
15
[6] T. B. Boykin, G. Klimeck, and F. Oyafuso, “Valence band effective-mass expressions in the sp  d  s  empirical tight-binding model applied
to a Si and Ge parametrization”, Phys. Rev. B 69, 115201 (2004).
[7] A. Martinez, M. Bescond, J. R. Barker, A. Svizhenkov, A. Anantram, C. Millar, and A. Asenov, “Self-consistent full 3D real-space NEGF
simulator for studying of non-perturbative effects in nano-MOSFET”, IEEE Trans. Electron Dev. 54, 2213 (2007).
[8] J. Wang, E. Polizzi, and M. S. Lundstrom, “A three-dimensional quantum simulation of silicon nanowire transistors with the effective-mass
approximation”, J. Appl. Phys. 96, 2192-2203 (2004)
[9] M. Luisier and A. Schenk, “Atomistic Simulation of Nanowire Transistors”, J. of Computational and Theoretical Nanoscience 5, 1-15
(2008).
[10] M. P. Lopez Sancho, J. M. Lopez Sancho, and J. Rubio, “Highly convergent schemes for the calculation of bulk and surface Green
functions”, J. Phys. F: Met. Phys., 15, 851-858 (1985).
[11] C. Rivas and R. Lake, “Non-equilibrium Green function implementation of boundary conditions for full band simulations of substrate-
nanowire structures”, Phys. Stat. Sol. B, 239, 94-102 (2003).
[12] M. Luisier, G. Klimeck, A. Schenk, and W. Fichtner, “Atomistic Simulation of Nanowires in the ,÷ Tight-Binding Formalism: from
Boundary Conditions to Strain Calculations, Phys. Rev. B, 74, 205323 (2006).
[13] Y. -J. Ko, M. Shin, S. Lee, and K. W. Park, “Effects of atomistic defects on coherent electron transmission in Si nanowires: Full band
calculations”, J. of Appl. Phys. 89, 374-379 (2001).
[14] M. Sta¨dele, B. R. Tuttle, and K. Hess, “Tunneling through ultrathin SiO  gate oxides from microscopic models”, J. of Appl. Phys. 89,
348-363 (2001).
[15] M. Luisier, “Full-Band Quantum Transport in Nanowire Transistors”, to appear in the upcoming edition of the J. of Comp. Electronics
(2008).
[16] http://www.tacc.utexas.edu/resources/hpcsystems/
[17] C. Catlett et al., “TeraGrid: Analysis of Organization, System Architecture, and Middleware Enabling New Types of Applications”, HPC
and Grids in Action, Ed. Lucio Grandinetti, IOS Press ’Advances in Parallel Computing’ series, Amsterdam (2007).
[18] P. M. Gresho and R. L. Sani, “Incompressible Flow and the Finite Element Method: Isothermal Laminar Flow”, John Wiley and Sons,
New York (2000).
[19] R. E. Bank, D. J. Rose, and W. Fichtner, “Numerical Methods for Semiconductor Device Simulation”, IEEE Trans. Electron Dev. 30, 1031
(1983).
[20] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A high-performance, portable implementation of the MPI message passing interface
standard”, Parallel Computing 22, 789 (1996).
[21] J. Dongarra, “Basic Linear Algebra Subprograms Technical Forum Standard”, International Journal of High Performance Applications and
Supercomputing, 16, 1–111 (2002).
[22] O. Schenk and K. Ga¨rtner, “Solving Unsymmetric Sparse Systems of Linear Equations with PARDISO, Journal of Future Generation
Computer Systems”, J. of Future Generation Computer Systems 20, 475 (2004).
[23] X. S. Li and J. W. Demmel “SuperLU DIST: A Scalable Distributed Memory Sparse Direct Solver for Unsymmetric Linear Systems”,
ACM Trans. on Math. Software 29, 110 (2003).
[24] P. R. Amestoy, I. S. Duff, and J.-Y. L’Excellent, “Multifrontal parallel distributed symmetric and unsymmetric solvers” Comput. Methods
in Appl. Mech. Eng. 184, 501 (2000).
[25] T. B. Boykin, M. Luisier, and G. Klimeck, “Multi-band transmission calculations for nanowires using an optimized renormalization
method”, to appear in Phys. Rev. B (2008).
[26] R. Lake, G. Klimeck, R. C. Bowen, and D. Jovanovic, “Single and multiband modeling of quantum electron transport through layered
semiconductor devices”, J. of Appl. Phys. 81, 7845 (1997).
[27] S. Cauley, J. Jain, C.-K. Koh, and V. Balakrishnan, “A scalable distributed method for quantum-scale device simulation”, J. Appl. Phys. 101,
123715 (2007).
[28] M. Luisier, A. Schenk, W. Fichtner, T. B. Boykin, and G. Klimeck, “A parallel sparse linear solver for nearest-neighbor tight-binding
problems”, submitted to EURO-PAR 2008 Conference, (2008).
[29] T. A. Davis, “A column pre-ordering strategy for the unsymmetric-pattern multifrontal method”, ACM Trans. on Math. Software 30, 165
(2004).
April 14, 2008 DRAFT
16
[30] E. Polizzi and A. H. Sameh, “A parallel hybrid banded system solver: the SPIKE algorithm” Parallel Comp. 32, 177 (2006).
[31] E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and
D. Sorensen, “LAPACK User’s Guide”, Third Edition, SIAM, Philadelphia (1999).
[32] L. S. Blackford et al., “ScaLAPACK Users’ Guide”, SIAM, Philadelphia (1997).
[33] R. S. Tuminaro, M. Heroux, S. A. Hutchinson, and J. N. Shadid, “Official Aztec User’s Guide: Version 2.1” (1999).
[34] M. Luisier, A. Schenk, and W. Fichtner, “Three-Dimensional Full-Band Simulations of Si Nanowire Transistors”, IEDM Tech. Digest
2006, 811 (2006).
[35] M. Luisier, A. Schenk, and W. Fichtner, “Atomistic treatment of interface roughness in Si nanowire transistors with different channel
orientations”, Appl. Phys. Lett. 90, 102103 (2007).
[36] M. Luisier, A. Schenk, and W. Fichtner, “Full-band atomistic study of source-to-drain tunneling in Si nanowire transistors”, Int. Conf. on
Simulation of Semiconductor Processes and Devices (SISPAD) Vienna, Austria (2007).
[37] M. Luisier, “Full-Band Quantum Transport in Nanowire Transistors” Int. Wokshop on Computational Electronics (IWCE), Amherst, USA
(2007).
[38] http://www.top500.org
[39] D. Z.-Y. Ting, M. Gu, X. Chi, and J. Cao, “Numerical Acceleration of Three-Dimensional Quantum Transport Method Using a Seven-
Diagonal Pre-Conditioner”, J. of Comp. Elec. 1, 93 (2004).
April 14, 2008 DRAFT
