Time-efficient simulations of tight-binding electronic structures with Intel Xeon PhiTM many-core processors  by Ryu, Hoon et al.
Computer Physics Communications 209 (2016) 79–87Contents lists available at ScienceDirect
Computer Physics Communications
journal homepage: www.elsevier.com/locate/cpc
Time-efficient simulations of tight-binding electronic structures with
Intel Xeon PhiTM many-core processors
Hoon Ryu a,∗, Yosang Jeong a, Ji-Hoon Kang a, Kyu Nam Cho b,1
a National Institute of Supercomputing and Networking, Korea Institute of Science and Technology Information, Daejeon 34141, Republic of Korea
b School of Electrical Engineering, Korea University, Seoul 02841, Republic of Korea
a r t i c l e i n f o
Article history:
Received 12 May 2016
Received in revised form
9 August 2016
Accepted 19 August 2016
Available online 26 August 2016
Keywords:
Tight-binding simulation
Electronic structure calculation
Many-core computing
Xeon PhiTM coprocessor
a b s t r a c t
Modelling ofmulti-million atomic semiconductor structures is important as it not only predicts properties
of physically realizable novel materials, but can accelerate advanced device designs. This work elaborates
a new Technology-Computer-Aided-Design (TCAD) tool for nanoelectronics modelling, which uses a
sp3d5s∗ tight-binding approach to describe multi-million atomic structures, and simulate electronic
structures with high performance computing (HPC), including atomic effects such as alloy and dopant
disorders. Being named as Quantum simulation tool for Advanced Nanoscale Devices (Q-AND), the tool
shows nice scalability on traditional multi-core HPC clusters implying the strong capability of large-
scale electronic structure simulations, particularly with remarkable performance enhancement on latest
clusters of Intel Xeon PhiTM coprocessors. A reviewof the recentmodelling study conducted to understand
an experimental work of highly phosphorus-doped silicon nanowires, is presented to demonstrate the
utility of Q-AND. Having been developed via Intel Parallel Computing Center project, Q-AND will be
open to public to establish a sound framework of nanoelectronics modelling with advanced HPC clusters
of a many-core base. With details of the development methodology and exemplary study of dopant
electronics, this work will present a practical guideline for TCAD development to researchers in the field
of computational nanoelectronics.
© 2016 The Author(s). Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).1. Introduction
The rapid progress in fabrication technologies has accelerated
the emergence of new classes of nanoscale devices and structures
that are expected to deliver revolutionary changes in electronics
and information technology. The scanning tunnelling microscope
lithography [1], for example, has led the realization of various
innovative silicon (Si) devices such as ultrathin interconnects that
are just four atom wide [2,3], and transistors where the transport
is controlled by a single atomic channel [4,5], representing the
ultimate limit of device downscaling. Semiconductor structures in
a nanoscale regime cannot be studied with bulk-physics anymore,
and electronic device engineering must meet material science
to develop bottom-up design concepts considering material
granularity and quantum effects of confined structures. First
∗ Corresponding author.
E-mail address: elec1020@kisti.re.kr (H. Ryu).
1 Present Address: Software Center, Samsung Electronics, Seoul 06765, Republic
of Korea.
http://dx.doi.org/10.1016/j.cpc.2016.08.015
0010-4655/© 2016 The Author(s). Published by Elsevier B.V. This is an open access art
4.0/).principle theories such as Density Functional Theory [6], have
led the modelling research of semiconductor nanostructures with
state-of-the-art simulation tools such as VASP [7]. The huge
computing cost however limits the scope of simulations to surfaces
and interfaces, or extremely small structures consisting of several
hundred atoms that are hard to consider atomic effects such
as alloy and dopant disorders that normally happen in the
lithographical process [8–12].
Experimentally realizable finite semiconductor structures are
typically in a few tenths of nanometres (nm) and consist of
several million atoms. To simulate these large-scale electronic
structures, empirical methods such as the parabolic effective mass
approximation (EMA) [13], k · p (KP) theory [14], and tight-
binding (TB) approach [15], have been popularly employed. EMA
holds in the vicinity of conduction band minima. But it does
not always ensure correct estimation of energy quantization in
nanostructures [16], and gives poor estimation of valence bands.
8-band KP theory gives better estimation of valence bands, but the
utilization is still limited due to the difficulty in modelling indirect
bandgap materials [17]. The nearest neighbour sp3d5s∗ TB model
however satisfies the accuracy condition as its parameters are fit
icle under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/
80 H. Ryu et al. / Computer Physics Communications 209 (2016) 79–87to reproduce experimentally known bulk band structures [15,18].
Its atomic description of simulation domains is advantageous since
phenomena such as interface/surface roughness, alloy and dopant
disorder can be treated easily.
NanoElectronic MOdelling tool (NEMO) [3,19–21], a well-
known package of TB simulations, has attracted research com-
munities of computational nanoelectronics, accelerating qualified
modelling researches that are strongly connected to experiments
[2–5,22–25]. It is quite obvious that NEMO has established the
framework of large-scale electronic structure calculations with
high performance computing (HPC), and has escalated the status
of an empirical TB approachwith respect to first principle theories.
NEMO however does not yet fully utilize the benefit of latest HPC
resources that support both multi-core (traditional CPUs that nor-
mally have a few cores) andmany-core (Intel Xeon PhiTM or Nvidia
GPU coprocessor cards that normally have dozens of cores) com-
puting. NEMO-3D [19] andNEMO-3D PETA [3,20] have been devel-
oped to run in parallel on HPC clusters of a multi-core base. NEMO
5 [21], the most latest version of NEMO, uses the SLEPc package to
compute electronic structures [26], whose enhancement in perfor-
mance with many-core computing has not been reported yet.
Taking the role of Intel Parallel Computing Center (IPCC) since
2014 for the first time in Asia-pacific area [27], we have devel-
oped a technology-computer-aided-design (TCAD) tool that would
be useful for designing advanced semiconducting materials and
devices of physically realizable sizes. Being named as Quantum
simulation tool for Advanced Nanoscale Devices (Q-AND), the tool
employs a nearest neighbour sp3d5s∗ empirical TB approach to de-
scribe semiconductor nanostructures, and solves electronic struc-
tures of multi-million atomic systems, showing a ∼30% speed-up
with Intel Xeon PhiTM coprocessors [28], compared to the case
where HPC clusters of a multi-core base are used. With a purpose
of introducing Q-AND to the nanoelectronics community for the
first time, this work delivers detailed descriptions of the follow-
ing items: (1) the development methodology focusing on the par-
allelization scheme, in-house solvers used for Schrödinger–Poisson
simulations, and the strategy to improve the computing perfor-
mance of electronic structure simulations with Xeon PhiTM co-
processors, (2) the strong and weak scalability on Tachyon-II HPC
cluster [29], measured for electronic structure calculations of sin-
gle phosphorus atom embedded in Si layers (Si:P quantum dots)
that contain several million atoms, (3) the performance improve-
ment measured on our testbed of Xeon PhiTM clusters, and (4) the
review of a modelling study that has been conducted for highly
phosphorus-doped Si nanowires (Si:P nanowires) [30]. Q-AND
package, which will be open to public soon through our IPCC web
site [27],will establish a sound framework of nanoelectronicsmod-
elling with latest HPC clusters of a many-core base. Details of the
development methodology with the exemplary study of dopant
electronics that this work presents, will also deliver a practical
guideline for TCAD development, accelerating the use of many-
core computing for large-scale simulations in nanoelectronics.
2. Methods
2.1. A scheme for parallelization of simulation domains
Q-AND performs Schrödinger–Poisson simulations to evaluate
electronic structure in a self-consistent manner. So, the basis of
parallel computing of large-scaled electronic structures depends
on how to decompose the Hamiltonian and Poisson systemmatrix
with multiple cores, which are fed to iterative solvers that will
be discussed in the next subsection. Nearest neighbour sp3d5s∗
TB approach uses a set of 10 (20 with spin–orbit coupling)
orbital basis to describe a single atom. The on-site term of each
atom and its coupling term with nearest neighbour atoms arerepresented with 10×10 (20×20with spin–orbit coupling) block
matriceswhose elements are all complex values. Consequently, the
degrees of freedom (DOFs) of Hamiltonian matrices that represent
nanostructures, become directly proportional to the number of
atoms composing of nanostructures as illustrated in Fig. 1(a),
which shows how a single zincblende unitcell is mapped to the
TB Hamiltonian. As Q-AND builds Poisson matrices with a Finite
Different Method (FDM), the logic of mapping domains to system
matrices is exactly same as shown in Fig. 1(a), except the on-site
and coupling termare representedwith a single real value and each
on-site spot has six nearest neighbours.
Simulation domains are decomposed in a multi-dimensional
manner to maximize the utility of parallel computing for
large-scale simulations. As Fig. 1(b) shows, the 3-dimensional
domain is decomposed along an x-direction with Message Passing
Interface (MPI), so slabs allocated to different MPI ranks cannot
lookup one another without communications. However, the slabs
allocated to a single rank are further decomposed using OpenMP
threads, effectively achieving a multi-dimensional decomposition
of simulation domains. The corresponding Hamiltonian and
Poisson matrices are then decomposed in a row-wise manner,
where blocks of the matrix allocated to different ranks can be
only accessed via inter-rank communications and calculations
involving a single block of the matrix (in a same rank) can be
further parallelized with communication-free thread computing.
As one can figure out easily from Fig. 1, both of the Hamiltonian
and Poissonmatrices are highly sparse because non-zero couplings
are assumed only among nearest neighbours, and all the system
matrices are therefore stored with a Compressed Sparse Row
format to save memory as much as possible [31]. As the
decomposition along y-/z-directions is only supported with
a shared-memory parallelization, Q-AND may not be able to
simulate devices if their dimensions along y-/z-directions are
too large such that a single computing node cannot store the
corresponding Hamiltonian matrix. However, we note that the
size-limit would be large enough to model devices in a quantum
region. For example, if a single node has a 16 GB memory and a
sp3d5s∗ TBmodel is usedwith no spin–orbit coupling, it can store a
Hamiltonianmatrix of up to a∼280nm×280nmSi (100) y–z plane
whose dimension along an x-direction is a single [100] unitcell.We
note that the detailed logistic for estimation of this size-limit is
available in the Supplementary material (see Appendix A).
2.2. Iterative solvers developed for Schrödinger–Poisson simulations
To determine electrostatics in confined structures self-
consistently, Q-AND solves a TB Schrödinger equation together
with a FDM Poisson equation. The supplementary information of
Ref. [3] elaborates the Schrödinger–Poisson loop implemented in
Q-AND, where the exchange–correlation energy term based on Lo-
cal Density Approximation [32] can be included in simulation pro-
cesses at users’ preference to correct the electrostatic potential
by interactions among strongly confined carriers. Following the
scheme for aMPI/OpenMP hybrid decomposition of simulation do-
mains described in Fig. 1(b),wehave implemented the Schrödinger
and Poisson solver with a well-known Lanczos [33] and Conjugate
Gradient (CG) [34] algorithm, respectively, which are purely iter-
ative and are therefore advantageous to achieve a high scalability
with parallel computing.
The computational flow of the Lanczos and CG algorithm in-
dicates that the iteration loop of both algorithms is composed of
the four types of basic numerical operations, which are vector
dot-product, vector addition, scalar–vectormultiplication, andma-
trix–vector multiplication [33,34]. Of these operations, the process
of scalar–vector multiplication and vector addition is straightfor-
ward, given vectors and matrices that are decomposed according
H. Ryu et al. / Computer Physics Communications 209 (2016) 79–87 81(a) [100] zincblende unitcell.
(b) Decomposition of simulation domains.
Fig. 1. Construction of TB Hamiltonian and parallelization scheme. (a) A single [100] zincblende unitcell and its mapping to a TB Hamiltonian matrix. Assuming nearest
neighbour coupling, the sp3d5s∗ TB model describes the on-site term of each atom and its coupling to nearest neighbour atoms with 10 × 10 (20 × 20 with spin–orbit
coupling) block matrices. DOFs of Hamiltonian matrices representing 3-dimensional nanostructures, therefore, become directly proportional to the number of atoms in
the nanostructures. (b) A multi-dimensional domain decomposition scheme. The simulation domain is decomposed along an x-direction with MPI ranks, where the slabs
allocated to a singleMPI rank are further decomposed using OpenMP threads. The Hamiltonianmatrix is then decomposed in a row-wisemanner, where blocks of thematrix
allocated to different MPI ranks can be only accessed through inter-rank communications and calculations involving a single block of the matrix (in a sameMPI rank) can be
further parallelized with communication-free thread computing.to the scheme of the domain-decomposition shown in Fig. 1(b).
The process of vector dot-product and matrix–vector multiplica-
tion however is a bit complicated since they require inter-rank
MPI communications thatmay degrade the scalability onHPC clus-
ters. Fig. 2 conceptually illustrates howvector dot-product andma-
trix–vectormultiplication are computed in parallel in our code. The
vector dot-product is computed by adding all local dot-products
through MPI collective communications (communications among
all the ranks), where the local dot-product in each MPI rank is
evaluated with thread computing using OpenMP. For the ma-
trix–vector multiplication, MPI communications are still neces-
sary since a vector allocated to each rank should be transferred to
other ranks to complete the multiplication process. The communi-
cation however is needed only between adjacent ranks since both
of Hamiltonian and Poisson matrices assume a nearest neighbour
coupling, as shown in Fig. 2.
2.3. Performance enhancement with many-core computing
As discussed in the previous subsection, the performance of
Lanczos and CG solver would depend on that of matrix–vector
multiplication and vector dot-product that involve MPI commu-
nications. Since the performance of large-scale electronic struc-
ture simulations particularly demonstrates strong dependence on
that of matrix–vector multiplication, as it will be discussed in the
next section, reducing the time required to performmatrix–vector
multiplicationwould be beneficial tomake overall simulations run
faster. As one of most innovative points of this work comparedto NEMO packages [3,19–21], Q-AND enhances the performance
of matrix–vector multiplication with Intel Knights Corner Xeon
PhiTM coprocessors [28,35], and many-core computing in an asyn-
chronous offload mode [36], with which the process of multiplica-
tion can be shared by both multi-core and many-core processors.
Fig. 3 illustrates how Q-AND computes matrix–vector multipli-
cation in an asynchronous offload mode. Given that a MPI rank i
(multi-core processor) has aM × N matrix Ai with a vector xj that
is transferred from rank j byMPI communications (or originally al-
located to rank i, if i = j) and the output vector yi, matrix–vector
multiplication can be done with a simultaneous utilization of
multi-core and many-core processors. To put this in more detail,
the computation is done as follows: (1) The matrix Ai is copied to
a coprocessor card just one time before Lanczos iteration starts.
(2) The vectors xi and xj’s are copied to the same coprocessor
card. (3) many-core processors perform multiplication involving
the first K rows of Ai, while multiplication involving other (M-K)
rows is being performedbymulti-core processors at the same time.
(4) The output vector of multiplication, yi, is then copied to multi-
core processors, being used to update the vector xi. (5) Lanczos iter-
ation is performed by repeating steps (2)–(4). As one may expect,
the ratio of the computing load taken by coprocessors to the to-
tal load, which can be represented by the quantity K/M without
losing generality, can then serve as a control factor for the overall
performance of electronic structure calculations. We note that Q-
AND demonstrates enhancement in computing performance up to
∼30% with selected benchmarking tests, which will be elaborated
further in detail in the next section.
82 H. Ryu et al. / Computer Physics Communications 209 (2016) 79–87(a) Vector dot-product.
(b) Matrix–vector multiplication.
Fig. 2. Parallel computing of vector dot-product and matrix–vector multiplication. (a) The process of computing vector-dot product. The local dot-product, the dot-product
of vectors allocated to each rank, is first evaluated with thread computing. ((i)→ (ii)). The total dot-product is then computed by adding all the local dot-products through
MPI communications among all the ranks ((ii)→ (iii)). (b) The process of computing matrix–vector multiplication. While the process here is described with a focus on Rank
2, it is also exactly same in other ranks. The matrix–vector multiplication is performed with the following steps: (i) Multiply a local matrix (the matrix allocated to each
rank) by a local vector (the vector allocated to each rank). (ii) Multiply a local matrix by a vector in the previous rank, which should be transferred by point-to-point MPI
communications. (iii) Multiply a local matrix by a vector in the next rank, which should be transferred by point-to-point MPI communications. (iv) Take an element-wise
sum of vectors obtained from the steps (i)–(iii).3. Results and discussion
3.1. Performance on HPC clusters of a multi-core base
The performance of Q-AND tool is investigated on traditional
HPC clusters of a multi-core base, particularly with electronic
structure calculations of Si:P quantum dots that have recently
obtained attention due to their promising utility as single electron
transistors for Si-based quantum computing applications [4,25].
Benchmarking tests are donebymeasuring the scalability for large-
scale simulations of a single phosphorus (P) atom placed exactly in
the centre of cuboid [100] Si layers of various sizes, where atomic
structures of devices are represented with a sp3d5s∗ TB model
(no spin–orbit coupling). The scalability is measured in two ways:
(1) the strong scalability that shows how the computing time
changes with different numbers of cores, particularly when the
problem size is fixed. (2) the weak scalability that focuses on the
computing time when the problem size is proportional to the
number of cores. All the tests are performed on Tachyon-II HPCcluster that is ranked at 373 on world’s top 500 HPC clusters [29].
Tachyon-II cluster has a total of 3200 computing nodes that are
connected with 10G network, where each node has 8 Intel Xeon
X5570 cores (2.93 GHz) and 24 GB memory.
As a benchmarking problem, we consider the electronic
structure of a Si:P quantumdot that includes a P atomencapsulated
by a cuboid Si layer of 1024 × 32 × 32 [100] unitcells (∼8.4
million atoms). Involving a Hamiltonian matrix of ∼8.4 × 107
DOFs, the calculation is performed with a convergence criterion
of 10−8 eV, and is completed when 104 Lanczos iterations are
reached, or 10 lowest energy-levels in conduction band are found.
Fig. 4(a) describes the strong scalability measured with 64–1024
ranks and 1–8 threads per rank (each rank or thread occupies
a single physical core), demonstrating a clear pattern that the
wall-time (the total time needed to complete the simulation) is
reduced as more cores are utilized. When a single thread is used,
the simulation shows a 8.6x reduction in the wall-time with a
16x increase of MPI ranks. The wall-time is also reduced as more
threads are used. In terms of the wall-time that is averaged over
H. Ryu et al. / Computer Physics Communications 209 (2016) 79–87 83Fig. 3. Asynchronous offload of matrix–vector multiplication. A strategy for
the performance enhancement in matrix–vector multiplication with many-core
computing (described with the example shown in Fig. 2(b)). The performance
of matrix–vector multiplication can be improved with a simultaneous utilization
of many-core and multi-core processors. While many-core processors perform
multiplication involving the first K rows of a matrix, multi-core processors
perform multiplication involving the other (M–K ) rows at the same time. The two
independent computing processes that happen simultaneously, are synchronized
at the end of multiplication, and the final result is obtained by taking an element-
wise sum of vectors obtained from the two processes. The ratio of the computing
load taken by coprocessors to the total load, being represented by the quantity K/M
without losing generality, can serve as a control factor for the overall performance
of electronic structure calculations.
Fig. 4. Strong scalability of electronic structure calculations. (a) The strong
scalability obtained on Tachyon-II HPC clusterwith simulations of the Si:P quantum
dot that has a single P atom encapsulated by a cuboid Si layer of 1024 × 32 × 32
[100] unitcells (∼8.4 million atoms). The wall-time is measured with 64–1024 MPI
ranks and1–8 threads per rank,where each rankor thread occupies a single physical
core. When a single thread is used, the wall-time shows a 8.6x reduction with
a 16x increase of MPI ranks. The wall-time is also reduced as more threads are
used, such that the computing shows a 2.6x speed-up with 8 threads, compared
to the case with a single thread. (b) The wall-time plotted in a component base.
The nice scalability is mainly due to the scalability of matrix–vector multiplication.
The burden of MPI communications does not get worse even as the number of
MPI ranks increases, as most of MPI communications happen between adjacent
ranks during the process of matrix–vector multiplication (Fig. 2(b)). The scalability
becomes worse as more ranks and threads are used, because the scaled computing
time ofmatrix–vectormultiplication then becomes comparable to, or overwhelmed
by the time taken by other processes that are not as scalable as matrix–vector
multiplication.
the data measured with a same number of threads, the computing
shows a 2.6x speed-up with 8 threads, compared to the case when
a single thread is utilized. Fig. 4(a) also indicates that the speed-up
becomes less remarkable as more cores are utilized. The speed-upFig. 5. Weak scalability of electronic structure calculations. (a) Theweak scalability
obtained on Tachyon-II HPC cluster with simulations of Si:P quantum dots of
cuboid Si layers of n × 32 × 32 [100] unitcells (n × 8192 atoms), where n is
the number of MPI ranks used for simulations. The wall-time is measured with
the same number of MPI ranks and threads as what is considered for the strong
scalability. (b) The wall-time plotted in a component base. Both of the overall and
component-wise weak scalability are fairly nice in general since the wall-time does
not show a remarkable fluctuation as the number of ranks changes. Results of the
strong and weak scalability indicate that simulations of much larger sizes could
be handled within a reasonable time as long as computing resources are available,
demonstrating the strong capability of Q-AND for computing large-scale problems
with HPC clusters of several hundreds of thousands of cores.
becomes 3.5x with a 4x increase of MPI ranks from 64 to 256, but
it reduces to 2.5x with the next 4x increase of MPI ranks from 256
to 1024. The similar pattern is also observed in thread computing
so the speed-up becomes 1.7x with a 2x increase of threads from 1
to 2, and 1.1x with a 2x increase from 4 to 8.
The pattern discussed in the previous paragraph can be
understood with Fig. 4(b) that shows the wall-time with the
four components, i.e., the time taken by MPI communications
(Comm), matrix–vector multiplication (MVMul), vector dot-
product (VVDot), and other operations (Others). From Fig. 4(b),
it is clear that matrix–vector multiplication takes a remarkable
portion of the entire computing load. The nice strong scalability of
the overall computation shown in Fig. 4(a) is therefore due to the
scalability of matrix–vector multiplication, which almost shows a
16x speed-up when the number of MPI ranks increases from 64 to
1024 (16x) at a same number of threads, and a 4.2x speed-upwhen
the number of threads increases from 1 to 8 (8x). One remarkable
point of our results is that the burden of MPI communications does
not get worse even if the number of MPI ranks increases, since
most of MPI communications happen between adjacent ranks
during the process of matrix–vector multiplication as illustrated
in Fig. 2(b). The overall strong scalability becomes worse as more
MPI ranks and threads are used, since the scaled computing
time of matrix–vector multiplication is then comparable to, or
overwhelmed by the time taken by other processes that are not
as scalable as matrix–vector multiplication.
To measure the weak scalability, we again simulate Si:P
quantum dots, but with cuboid Si layers of n × 32 × 32 [100]
unitcells (n×8192 atoms),wheren is the number ofMPI ranks used
for simulations. The overall and component-wise scalability are
shown in Fig. 5(a) and (b), respectively, which can be interpreted
to be fairly nice in general since the wall-time does not have
a remarkable dependency on the number of MPI ranks. The
focal message delivered by the results of the weak scalability, is
that simulations even of much larger sizes can be handled in a
reasonable computing time, as long as more computing nodes can
be procured. In the current situation where it is easy to find HPC
84 H. Ryu et al. / Computer Physics Communications 209 (2016) 79–87Fig. 6. Performancewith Intel Xeon PhiTM many-core processors. (a) The strong scalability obtainedwith simulations of Si:P quantum dots of cuboid Si layers of 60×80×80
[100] unitcells (3 million atoms). A new control factor Load_MC, is introduced to see how the performance of electronic structure calculations changes with respect to the
ratio of the computing load of matrix–vector multiplication in coprocessors to the total load. With a fairly nice strong scalability, the result shows the wall-time becomes
minimal when Load_MC is 65% regardless of the number of cores, where the red line plotted indicates the strong scalability when Load_MC is 65%. We note the scalability
of Q-AND with a zero Load_MC (the case with multi-core processors only) is not remarkably different to that of NEMO-3D PETA (Refs. [3,20]). (b) The wall-time plotted in a
component base. All the three subplots show that a∼30% speed-up of the total computing time is achieved with 65% of Load_MC, compared to the case when coprocessors
are not used (Load_MC = 0), which is due to matrix–vector multiplication that shows an average speed-up of∼52% with 65% of Load_MC. When Load_MC= 65%, Q-AND
clearly shows a better performance than NEMO-3D PETA regardless of the number of cores. (c) Theweak scalability obtainedwith simulations of Si:P quantum dots of cuboid
Si layers of 15n× 80× 80 [100] unitcells (n× 7.5× 105 atoms), where n is the number of ranks. The wall-time is observed to be quite insensitive to the number of ranks
and Load_MC, indicating the strong capability for large-scaled electronic structure calculations.clusters that have several hundreds of thousands of cores [37],
Q-AND could be suitable for exascale simulations, particularly for
devices that have extremely long channels.
3.2. Performance with Intel Xeon PhiTM many-core processors
The performance of electronic structure calculations is im-
proved with many-core computing in an asynchronous offload
load, by which many-core and multi-core processors can be used
simultaneously, as described in Fig. 3. The performance is bench-
marked in our small testbed of three computing nodes connected
with 10G network, where each node has 20 Intel Xeon E5-2670 v2
(2.5 GHz) cores, 256 GB memory and 2 Intel Xeon PhiTM 7120A co-
processor cards, and each coprocessor card has 61 physical cores
(1.24 GHz) with 16 GB memory. The strong scalability is studied
with a Si:P quantum dot that has a cuboid Si layer of 60× 80× 80
[100] unitcells (∼3 million atoms). The weak scalability is studied
by simulating Si:P quantumdotswith cuboid Si layers of 15n×80×
80 [100] unitcells (n × 7.5 × 105 atoms), where n is the number
of MPI ranks. All other conditions of simulations, i.e., convergence
criteria, maximal number of iterations and energy-levels, are same
as what are used in the previous subsection.
The strong scalability, measured with 2/4/6 MPI ranks and 10
threads per rank, is shown in Fig. 6(a), where the third control
factor Load_MC represents the ratio of the computing load ofmatrix–vector multiplication in coprocessors to the total load.
The performance of NEMO-3D PETA [3,20], has been also plotted
together, wherewe used 2/4/6MPI ranks for decomposition along
an x-direction with 10 MPI ranks for decomposition along other
directions (5 for y-, 2 for z-direction) to compare the performance
as fairly as possible. The strong scalability of Q-AND is excellent
in general regardless of the Load_MC quantity. We also note that
the wall-times with a zero Load_MC (the case where only multi-
core processors are used) do not show a remarkable difference
from those of NEMO-3D PETA. Fig. 6(a) however delivers an
importantmessage that there is an optimal point for the computing
performance in terms of Load_MC. In particular, the wall-time
becomes minimal when Load_MC is 65% regardless of the number
of cores, which indicates that the performance of electronic
structure calculations is maximized when 65% of matrix–vector
multiplication is done in coprocessors. This phenomenon can be
observed more clearly from Fig. 6(b), where the component-wise
computing times (Comm/MVMul/VVDot/Others) with 20/40/60
cores are plotted as a function of Load_MC. Here we note that
the time taken for the vector-transfer (during Lanczos iteration)
and for the matrix-transfer (before Lanczos iteration starts) are
included in MVMul and Others, respectively. All the three subplots
in Fig. 6(b) confirm that a ∼30% speed-up of the total computing
time is achieved with 65% of Load_MC, compared to the case when
Load_MC is 0% (the case when coprocessors are not used), which
is clearly due to the speed-up of matrix–vector multiplication that
H. Ryu et al. / Computer Physics Communications 209 (2016) 79–87 85Fig. 7. Dopant-incorporation in Si:P nanowires. (a) [110] transport-oriented circular Si:P nanowires. Nanowires are describedwith supercells assuming a periodic boundary
condition along the transport direction, and are assumed to have an average doping density of∼2×1019 cm−3 . (b) Dopant-distributions of high-doped nanowires. For each of
3 different channel-sizes (diameter of 16/20/24 nm) 3 phases of dopant-distributions are considered: (1) The phase where dopants are placed near surfaces (Phase I), (2) The
phase where some dopants start to move into the channel (Phase II), (3) the final phase where donors are distributed quite uniformly (Phase III). (c) The variation in channel
energy plotted as a function of doping phases and cross-section sizes.When nanowires have cross-sections≥20 nm, the channel energy is reduced as the dopant-distribution
becomes more uniform, while 16 nm channels do not necessarily show the similar pattern, limiting the variation in energy within 3kBT at 300 K.turns out to be ∼52% on average with 65% of Load_MC. We also
note the performance of Q-AND with 65% of Load_MC is clearly
better than that of NEMO-3D PETA regardless of the number of
cores. As shown in Fig. 6(c), the weak scalability of electronic
structure calculations is also nice even in the cluster of amany-core
base, because the wall-time is not varied much by the number of
ranks and Load_MC.
It should be noted that the exact reason why the perfor-
mance of matrix–vector multiplication is maximized with 65% of
Load_MC, has not beenyet clearly understood.With a simplemath-
ematical process presented in the Supplementary material (see
Appendix A), the ceiling of the optimal Load_MC value can be es-
timated to be ∼86% with theoretical performances of multi-core
andMany-core processors of our testbed. However,we believe that
the actual value would be lower (as measured to be ∼65%) since
the wall-time of multiplication used in this work includes the time
of data-transfer between host (multi-core processors) and Xeon
PhiTM coprocessor cards. We also expect that the performance of
(sparse) matrix–vector multiplication would be affected by the
data-locality (cache-miss).
3.3. Utility of Q-AND: Dopant-incorporation in realistically sized Si
nanowires
The dopant-distribution in nanostructures causes remarkable
fluctuations in material properties, having the potential as a
key control factor for device engineering [3,38–40]. In spite of
many experimental efforts, the dopant-incorporation is the one of
uncovered issues in nanoscale device engineering since there are
fundamental difficulties in doping nanostructures [41,42], and it is
not easy to get uniformly doped nanowires [43–45], which would
be important to design interconnects with doped nanostructure.
Xie et al., have recently examined distributions of P dopants in free-
standing highly doped Si nanowires and demonstrated the unique
dopant rearrangement characteristic decoupled with the growth
process, with a strong experimental message that the distribution
of P dopants would be generally hard to be uniform in channels of
cross-sections smaller than∼22 nm [39]. While the experimentalresult can attract researchers who are keen in building devices
with doped nanowires, the relation between channel-sizes and
dopant-distributions still needs to be understood in a theory
perspective, particularly including the effect of dopant disorder
that is fundamentally tough to be avoided during lithographical
processes. The utility of Q-AND package is demonstrated here
with a review of the recent modelling study [30], which has been
conducted to understand the experimental observation of dopant-
incorporation in Si nanowires [39].
As illustrated in Fig. 7(a), [110] circular Si:P nanowires are
described with supercells assuming a periodic boundary condi-
tion along the transport direction, and are assumed to be doped
with an average density of ∼2 × 1019 cm−3. For nanowire chan-
nels of 16/20/24 nm cross-sections, bandstructures in charge-
neutrality are computed self-consistently with Q-AND package.
Despite of being assumed to be periodic, nanowire supercells
have 7704/12,056/17,338 atoms depending on the size of diam-
eters. Real-space domains of all the supercells are decomposed
with 8 cores (2 cores for decomposition along the [110] direction
(x-direction); 4 cores for decomposition of channel cross-sections
(y–z planes)). To investigate how a channel-size and a dopant-
distribution are correlated, we have set up simulations with 9
nanowire supercells as shown in Fig. 7(b). For each of 3 different
channel-sizes (diameter of 16/20/24 nm), we consider 3 phases of
dopant-distributions: (1) the phase where dopants are placed near
surfaces (Phase I), (2) the phase where some dopants start to move
into the channel (Phase II), (3) the final phasewhere donors are dis-
tributed quite uniformly as they are in Si bulk (Phase III). Fig. 7(c)
shows that, for nanowires of cross-sections ≥20 nm, the channel
energy, which is defined as the electronic contribution to the total
energy and is quantified with the energy of occupied sub-bands, is
reduced as the dopant-distribution becomes more uniform (Phase
I→ Phase III). When channels have a cross-section of 16 nm, how-
ever, more uniform doping does not necessarily reduce the energy,
placing all the variation in energywithin 3kBT (∼77meV at 300 K).
The result in Fig. 7(c) delivers a non-trivial message that
∼20 nm would be the minimal cross-section of highly doped
nanowires where surface donors would necessarily want to be
86 H. Ryu et al. / Computer Physics Communications 209 (2016) 79–87Fig. 8. Effects of random dopant placement on channel energy. (a) Geometry of nanowire supercells considered for simulations with random dopant placements. We
simulated nanowires of 16/20 nm channel cross-sections, but with supercells that are 2x longer along the transport direction to study effects of random dopant placements
on the channel energy variation. (b) Steps to generate random dopant-distributions. Assuming the 1st slab of a supercell has the same dopant-distribution as the one shown
in Fig. 7(b), we created dopant-distributions on the 2nd slab with the following two processes — (i) rotate the dopant-distribution on the 1st slab by 30°–330° on (001)
plane, and (ii) place each dopant within a circle (radius = 1 nm) whose central point is the dopant-position determined by (i). (c) Energy variation with error-bars plotted
with random dopant placements. 20 dopant-distributions are simulated for each doping phase with a random generation of 4 rotation angles and 5 dopant placements. In
16 nm channels, more uniform doping does not necessarily reduce the mean value of the channel energy. While more uniform doping (Phase I→ Phase III) may reduce the
channel energy if error-bars are considered, the reduction turns out to be smaller than 3kBT .distributed uniformly, since the uniform doping reduces the
channel energy increasing the system-stability. Effects of random
dopant placements on this core message are explored with
additional simulations, where 20 cases of dopant placements are
generated randomly for each doping phase in 2x longer supercells
of 16/20 nm cross-sections with a strategy elaborated in Fig. 8(a)
and (b). Fig. 8(c) shows how the mean of the channel energy
varies with the dopant-distribution, where error-bars (standard
deviation) are added to show the fluctuation of the energy
stemming from random dopant placements. Here, we observe that
the mean value of the energy in 16 nm channels is not necessarily
reduced as the dopant-distribution becomes more uniform (Phase
I→ Phase III), while a clear reduction is found in 20 nm channels.
Even in 16 nm channels, more uniform doping may reduce
the channel energy if error-bars are considered. The reduction,
however, becomes smaller than 3kBT so we claim the main
message that surface donors would not necessarily be distributed
uniformly in channels thinner than 20 nm, is still valid, presenting
a strong connection to the experimental result [39], which finds
∼22 nm to be the size-limit at a doping density of∼2×1020 cm−3.
4. Conclusion
A new in-house technology-computer-aided-design (TCAD)
tool for tight-binding simulations of electronic structures is in-
troduced. Being named as Quantum simulation tool for Advanced
Nanoscale Devices (Q-AND), the tool describes multi-million
atomic structures and simulates electronic structures with aids ofhigh performance computing (HPC). Based on amulti-dimensional
domain decomposition with Message Passing Interface (MPI) and
OpenMP, Q-AND shows a fairly nice strong andweak scalability on
HPC clusters of a multi-core base, indicating the capability of ex-
tremely large-scale simulations. Another remarkable and unique
strength of Q-AND compared to previously known tight-binding
simulation tools, is its adaptability to latest clusters of a coproces-
sor (many-core) base, which shows an excellent enhancement in
computing performance up to∼30% on clusters of Intel Xeon PhiTM
coprocessors, compared to the case where only multi-core proces-
sors are used. The scheme for a hybrid decomposition of simula-
tion domains with MPI and OpenMP, iterative solvers developed
for Schrödinger–Poisson simulations, and the strategy to enhance
the performance with many-core computing in an asynchronous
offload mode, are discussed to deliver a practical guideline
for TCAD development to researchers in the field of computational
nanoelectronics. The utility of Q-AND for researches is demon-
strated with a modelling study that has been recently conducted
to clarify experimentally observed dopant-incorporations in highly
phosphorus-doped silicon nanowires that have realistically sized
cross-sections. Having been developed via Intel Parallel Comput-
ing Center project, Q-AND will be soon open to public, to establish
a sound framework of modelling researches with many-core com-
puting in the field of nanoelectronics.
Acknowledgements
This work has been carried out under the support from
Intel Parallel Computing Center (IPCC) project funded by Intel
H. Ryu et al. / Computer Physics Communications 209 (2016) 79–87 87Corporation, USA, and with the extensive use of Tachyon-II high
performance computing resource supported by Korea Institute of
Science and Technology Information, Republic of Korea. Hoon Ryu
would like to thank Jeehye Sohn for all the invaluable support and
encouragement for research.
Appendix A. Supplementary material
Supplementary material related to this article can be found
online at http://dx.doi.org/10.1016/j.cpc.2016.08.015.
References
[1] F.J. Ruess, L. Oberbeck, M.Y. Simmons, K.E.J. Goh, A.R. Hamilton, T. Hallam, S.R.
Schofield, N.J. Curson, R.G. Clark, Nano Lett. 4 (10) (2004) 1969–1973.
[2] B.Weber, S. Mahapatra, H. Ryu, S. Lee, A. Fuhrer, T.C.G. Reusch, D.L. Thompson,
W.C.T. Lee, G. Klimeck, L.C.L. Hollenberg, M.Y. Simmons, Science 335 (2012)
64–67.
[3] H. Ryu, S. Lee, B. Weber, S. Mahapatra, L.C.L. Hollenberg, M.Y. Simmons,
G. Klimeck, Nanoscale 5 (18) (2013) 8666–8674.
[4] M. Fuechsle, J.A. Miwa, S. Mahapatra, H. Ryu, S. Lee, O. Warschkow, L.C.L.
Hollenberg, G. Klimeck, M.Y. Simmons, Nature Nanotechonology 7 (2012)
242–246.
[5] H. Ryu, S. Lee, M. Fuechsle, J.A. Miwa, S. Mahapatra, L.C.L. Hollenberg,
M.Y. Simmons, G. Klimeck, Small 11 (3) (2015) 374–381.
[6] N. Argaman, G. Makov, Amer. J. Phys. 68 (2000) 69.
[7] Vienna ab initio simulation package (vasp), https://www.vasp.at.
[8] G.M. Dalpian, J.R. Chelikowsky, Phys. Rev. Lett. 96 (5) (2006) 226802.
[9] K.-H. Hong, J. Kim, J.H. Lee, J. Shin, U.-I. Chung, Nano Lett. 10 (5) (2010)
1671–1676.
[10] S. Kim, J.-S. Park, K.J. Chang, Nano Lett. 12 (10) (2012) 5068–5073.
[11] M.-V. Fernandez-Serra, C. Adessi, X. Blase, Nano Lett. 6 (12) (2006) 2674–2678.
[12] D.W. Drumm, J.S. Smith, M.C. Per, A. Budi, L.C.L. Hollenberg, S.P. Russo, Phys.
Rev. Lett. 110 (2013) 126802.
[13] J. Wang, A. Rahman, A. Ghosh, G. Klimeck, M. Lundstrom, IEEE Trans: Electron
Device 52 (7) (2005) 1589–1595.
[14] Y.X. Liu, D.Z.Y. Ting, T.C. McGill, Phys. Rev. B 54 (1996) 5675.
[15] J.-M. Jancu, R. Scholz, F. Beltram, F. Bassani, Phys. Rev. B 57 (1998) 6493.
[16] M. Luisier, A. Schenk, W. Fichtner, Phys. Rev. B 74 (2006) 205323.
[17] D.J. Paul, Phys. Rev. B 77 (2008) 155323.
[18] T.B. Boykin, G. Klimeck, F. d Oyafuso, Phys. Rev. B 69 (2004) 115201.
[19] G. Klimeck, S. Ahmed, H. Bae, N. Kharche, R. Rahman, S. Clark, B. Haley, S. Lee,
M. Naumov, H. Ryu, F. Saied, M. Prada, M. Korkusinski, T. Boykin, IEEE Trans.
Electron Devices 54 (9) (2007) 2079–2089.[20] S. Lee, H. Ryu, Z. Jiang, G. Klimeck, Proceedings of IEEE InternationalWorkshop
on Computational Electronics, IWCE, 2013, pp. 1–4. http://dx.doi.org/10.1109/
IWCE.2009.5091117.
[21] S. Steiger, M. Povolotskyi, H.H. Park, T. Kubis, G. Klimeck, IEEE Trans.
Nanotechnology 10 (6) (2011) 1464–1474.
[22] N. Kharche, M. Prada, T.B. Boykin, G. Klimeck, Appl. Phys. Lett. 90 (9) (2007)
092109.
[23] G.P. Lansbergen, R. Rahman, C.J. Wellard, I. Woo, J. Caro, N. Collaert, S.
Biesemans, G. Klimeck, L.C.L. Hollenberg, S. Rogge, Nat. Phys. 4 (2008)
656–661.
[24] M. Usman, H. Ryu, I. Woo, D.S. Ebert, G. Klimeck, IEEE Trans. Nanotechnology
8 (3) (2009) 330–344.
[25] B. Weber, Y.H.M. Tan, S. Mahapatra, T.F. Watson, H. Ryu, R. Rahman, G.K.L.C.L.
Hollenberg, M.Y. Simmons, Nature Nanotechonology 9 (2014) 430–435.
[26] The scalable library for eigenvalue problem computations (slepc),
http://slepc.upv.es.
[27] Intel parallel computing center at korea institute of science and tech-
nology information, https://software.intel.com/en-us/articles/intel-parallel-
computing-center-at-kisti.
[28] Intel xeon phiTM coprocessor, https://software.intel.com/en-us/mic-
developer.
[29] Top 500 list: Tachyon-ii high performance computing cluster,
http://www.top500.org/system/176727.
[30] H. Ryu, J. Kim, K.H. Hong, Nano Lett. 1 (2015) 450–456.
[31] A. Buluç, J.T. Fineman, M. Frigo, J.R. Gilbert, C.E. Leiserson, Proceedings of
the annual Symposium on Parallelism in Algorithms and Architectures, SPAA,
2009, pp. 233–244. http://dx.doi.org/10.1145/1583991.1584053.
[32] E. Gawlinski, T. Dzurak, R.A. Tahir-Kheli, J. Appl. Phys. 72 (8) (1992) 3562.
[33] C. Lanczos, J. Res. Natl. Bur. Stand. 45 (4) (1950) 255–282.
[34] M.R. Hestenes, E. Stiefel, J. Res. Natl. Bur. Stand. 49 (6) (1952) 409–436.
[35] Intel knights corner xeon phiTM coprocessors, http://ark.intel.com/products/
codename/57721/Knights-Corner.
[36] Intel developer zone: Asynchronous offload - c++ code example, https://
software.intel.com/en-us/articles/asynchronous-offload-c-code-examples.
[37] Top 10 hpc sites as of november 2015, http://www.top500.org/lists/2015/11/.
[38] T. Shinada, S. Okamoto, T. Kobayashi, I. Ohdomari, Nature 437 (2005)
1128–1131.
[39] P. Xie, Y. Hu, J. Huang, C.M. Lieber, Proc. Natl. Acad. Sci. 106 (36) (2009)
15254–15258.
[40] J. Han, T.-L. Chan, J.R. Chelikowsky, Phys. Rev. B 82 (2010) 153413.
[41] M.T. Björk, H. Schmid, J. Knoch, H. Riel, W. Riess, Nature Nanotechonology 4
(2009) 103–107.
[42] D.J. Norris, A.L. Efros, S.C. Erwin, Science 319 (2008) 1776–1770.
[43] D.E. Perea, E.R. Hemesath, E.J. Schwalbach, J.L. Lensch-Falk, P.W. Voorhees,
L.J. Lauhon, Nature Nanotechonology 4 (2009) 315–319.
[44] E. Koren, N. Berkovitch, Y. Rosenwaks, Nano Lett. 10 (4) (2010) 1163–1167.
[45] E. Koren, J.K. Hyun, U. Givan, E.R. Hemesath, L.J. Lauhon, Y. Rosenwaks, Nano
Lett. 11 (1) (2011) 183–187.
