4 research outputs found
Enhancing the scalability and load balancing of the parallel selected inversion algorithm via tree-based asynchronous communication
We develop a method for improving the parallel scalability of the recently
developed parallel selected inversion algorithm [Jacquelin, Lin and Yang 2014],
named PSelInv, on massively parallel distributed memory machines. In the
PSelInv method, we compute selected elements of the inverse of a sparse matrix
A that can be decomposed as A = LU, where L is lower triangular and U is upper
triangular. Updating these selected elements of A-1 requires restricted
collective communications among a subset of processors within each column or
row communication group created by a block cyclic distribution of L and U. We
describe how this type of restricted collective communication can be
implemented by using asynchronous point-to-point MPI communication functions
combined with a binary tree based data propagation scheme. Because multiple
restricted collective communications may take place at the same time in the
parallel selected inversion algorithm, we need to use a heuristic to prevent
processors participating in multiple collective communications from receiving
too many messages. This heuristic allows us to reduce communication load
imbalance and improve the overall scalability of the selected inversion
algorithm. For instance, when 6,400 processors are used, we observe over 5x
speedup for test matrices. It also mitigates the performance variability
introduced by an inhomogeneous network topology
A Left-Looking Selected Inversion Algorithm and Task Parallelism on Shared Memory Systems
Given a sparse matrix , the selected inversion algorithm is an efficient
method for computing certain selected elements of . These selected
elements correspond to all or some nonzero elements of the LU factors of .
In many ways, the type of matrix updates performed in the selected inversion
algorithm is similar to that performed in the LU factorization, although the
sequence of operation is different. In the context of LU factorization, it is
known that the left-looking and right-looking algorithms exhibit different
memory access and data communication patterns, and hence different behavior on
shared memory and distributed memory parallel machines. Corresponding to
right-looking and left-looking LU factorization, selected inversion algorithm
can be organized as a left-looking and a right-looking algorithm. The parallel
right-looking version of the algorithm has been developed in [1]. The sequence
of operations performed in this version of the selected inversion algorithm is
similar to those performed in a left-looking LU factorization algorithm. In
this paper, we describe the left-looking variant of the selected inversion
algorithm, and based on task parallel method, present an efficient
implementation of the algorithm for shared memory machines. We demonstrate that
with the task scheduling features provided by OpenMP 4.0, the left-looking
selected inversion algorithm can scale well both on the Intel Haswell multicore
architecture and on the Intel Knights Corner (KNC) manycore architecture.
Compared to the right-looking selected inversion algorithm, the left-looking
formulation facilitates pipelining of work along different branches of the
elimination tree, and can be a promising candidate for future development of
massively parallel selected inversion algorithms on heterogeneous architecture.Comment: 9 pages, 7 figures, submitted to SuperComputing 201
Robust Determination of the Chemical Potential in the Pole Expansion and Selected Inversion Method for Solving Kohn-Sham density functional theory
Fermi operator expansion (FOE) methods are powerful alternatives to
diagonalization type methods for solving Kohn-Sham density functional theory
(KSDFT). One example is the pole expansion and selected inversion (PEXSI)
method, which approximates the Fermi operator by rational matrix functions and
reduces the computational complexity to at most quadratic scaling for solving
KSDFT. Unlike diagonalization type methods, the chemical potential often cannot
be directly read off from the result of a single step of evaluation of the
Fermi operator. Hence multiple evaluations are needed to be sequentially
performed to compute the chemical potential to ensure the correct number of
electrons within a given tolerance. This hinders the performance of FOE methods
in practice. In this paper we develop an efficient and robust strategy to
determine the chemical potential in the context of the PEXSI method. The main
idea of the new method is not to find the exact chemical potential at each
self-consistent-field (SCF) iteration iteration, but to dynamically and
rigorously update the upper and lower bounds for the true chemical potential,
so that the chemical potential reaches its convergence along the SCF iteration.
Instead of evaluating the Fermi operator for multiple times sequentially, our
method uses a two-level strategy that evaluates the Fermi operators in
parallel. In the regime of full parallelization, the wall clock time of each
SCF iteration is always close to the time for one single evaluation of the
Fermi operator, even when the initial guess is far away from the converged
solution. We demonstrate the effectiveness of the new method using examples
with metallic and insulating characters, as well as results from ab initio
molecular dynamics.Comment: Submitted to Journal of Chemical Physic
PEXSI-: A Green's function embedding method for Kohn-Sham density functional theory
In this paper, we propose a new Green's function embedding method called
PEXSI- for describing complex systems within the Kohn-Sham density
functional theory (KSDFT) framework, after revisiting the physics literature of
Green's function embedding methods from a numerical linear algebra perspective.
The PEXSI- method approximates the density matrix using a set of nearly
optimally chosen Green's functions evaluated at complex frequencies. For each
Green's function, the complex boundary conditions are described by a self
energy matrix constructed from a physical reference Green's function,
which can be computed relatively easily. In the linear regime, such treatment
of the boundary condition can be numerically exact. The support of the
matrix is restricted to degrees of freedom near the boundary of computational
domain, and can be interpreted as a frequency dependent surface potential. This
makes it possible to perform KSDFT calculations with
computational complexity, where is the number of atoms within the
computational domain. Green's function embedding methods are also naturally
compatible with atomistic Green's function methods for relaxing the atomic
configuration outside the computational domain. As a proof of concept, we
demonstrate the accuracy of the PEXSI- method for graphene with
divacancy and dislocation dipole type of defects using the DFTB+ software
package