1,691 research outputs found
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
Previous studies have reported that common dense linear algebra operations do
not achieve speed up by using multiple geographical sites of a computational
grid. Because such operations are the building blocks of most scientific
applications, conventional supercomputers are still strongly predominant in
high-performance computing and the use of grids for speeding up large-scale
scientific problems is limited to applications exhibiting parallelism at a
higher level. We have identified two performance bottlenecks in the distributed
memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear
algebra library. First, because ScaLAPACK assumes a homogeneous communication
network, the implementations of ScaLAPACK algorithms lack locality in their
communication pattern. Second, the number of messages sent in the ScaLAPACK
algorithms is significantly greater than other algorithms that trade flops for
communication. In this paper, we present a new approach for computing a QR
factorization -- one of the main dense linear algebra kernels -- of tall and
skinny matrices in a grid computing environment that overcomes these two
bottlenecks. Our contribution is to articulate a recently proposed algorithm
(Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in
order to confine intensive communications (ScaLAPACK calls) within the
different geographical sites. An experimental study conducted on the Grid'5000
platform shows that the resulting performance increases linearly with the
number of geographical sites on large-scale problems (and is in particular
consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed
Processing Symposium 2010 in Atlanta, GA, USA.
Introduction to StarNEig -- A Task-based Library for Solving Nonsymmetric Eigenvalue Problems
In this paper, we present the StarNEig library for solving dense
non-symmetric (generalized) eigenvalue problems. The library is built on top of
the StarPU runtime system and targets both shared and distributed memory
machines. Some components of the library support GPUs. The library is currently
in an early beta state and only real arithmetic is supported. Support for
complex data types is planned for a future release. This paper is aimed for
potential users of the library. We describe the design choices and capabilities
of the library, and contrast them to existing software such as ScaLAPACK.
StarNEig implements a ScaLAPACK compatibility layer that should make it easy
for a new user to transition to StarNEig. We demonstrate the performance of the
library with a small set of computational experiments.Comment: 10 pages, 4 figures (10 when counting sub-figures), 2 tex-files.
Submitted to PPAM 2019, 13th international conference on parallel processing
and applied mathematics, September 8-11, 2019. Proceedings will be published
after the conference by Springer in the LNCS series. Second author's first
name is "Carl Christian" and last name "Kjelgaard Mikkelsen
\u3cem\u3eHP-DAEMON\u3c/em\u3e: \u3cem\u3eH\u3c/em\u3eigh \u3cem\u3eP\u3c/em\u3eerformance \u3cem\u3eD\u3c/em\u3eistributed \u3cem\u3eA\u3c/em\u3edaptive \u3cem\u3eE\u3c/em\u3energy-efficient \u3cem\u3eM\u3c/em\u3eatrix-multiplicati\u3cem\u3eON\u3c/em\u3e
The demands of improving energy efficiency for high performance scientific applications arise crucially nowadays. Software-controlled hardware solutions directed by Dynamic Voltage and Frequency Scaling (DVFS) have shown their effectiveness extensively. Although DVFS is beneficial to green computing, introducing DVFS itself can incur non-negligible overhead, if there exist a large number of frequency switches issued by DVFS. In this paper, we propose a strategy to achieve the optimal energy savings for distributed matrix multiplication via algorithmically trading more computation and communication at a time adaptively with user-specified memory costs for less DVFS switches, which saves 7.5% more energy on average than a classic strategy. Moreover, we leverage a high performance communication scheme for fully exploiting network bandwidth via pipeline broadcast. Overall, the integrated approach achieves substantial energy savings (up to 51.4%) and performance gain (28.6% on average) compared to ScaLAPACK pdgemm() on a cluster with an Ethernet switch, and outperforms ScaLAPACK and DPLASMA pdgemm() respectively by 33.3% and 32.7% on average on a cluster with an Infiniband switch
A distributed-memory package for dense Hierarchically Semi-Separable matrix computations using randomization
We present a distributed-memory library for computations with dense
structured matrices. A matrix is considered structured if its off-diagonal
blocks can be approximated by a rank-deficient matrix with low numerical rank.
Here, we use Hierarchically Semi-Separable representations (HSS). Such matrices
appear in many applications, e.g., finite element methods, boundary element
methods, etc. Exploiting this structure allows for fast solution of linear
systems and/or fast computation of matrix-vector products, which are the two
main building blocks of matrix computations. The compression algorithm that we
use, that computes the HSS form of an input dense matrix, relies on randomized
sampling with a novel adaptive sampling mechanism. We discuss the
parallelization of this algorithm and also present the parallelization of
structured matrix-vector product, structured factorization and solution
routines. The efficiency of the approach is demonstrated on large problems from
different academic and industrial applications, on up to 8,000 cores.
This work is part of a more global effort, the STRUMPACK (STRUctured Matrices
PACKage) software package for computations with sparse and dense structured
matrices. Hence, although useful on their own right, the routines also
represent a step in the direction of a distributed-memory sparse solver
High-Performance Solvers for Dense Hermitian Eigenproblems
We introduce a new collection of solvers - subsequently called EleMRRR - for
large-scale dense Hermitian eigenproblems. EleMRRR solves various types of
problems: generalized, standard, and tridiagonal eigenproblems. Among these,
the last is of particular importance as it is a solver on its own right, as
well as the computational kernel for the first two; we present a fast and
scalable tridiagonal solver based on the Algorithm of Multiple Relatively
Robust Representations - referred to as PMRRR. Like the other EleMRRR solvers,
PMRRR is part of the freely available Elemental library, and is designed to
fully support both message-passing (MPI) and multithreading parallelism (SMP).
As a result, the solvers can equally be used in pure MPI or in hybrid MPI-SMP
fashion. We conducted a thorough performance study of EleMRRR and ScaLAPACK's
solvers on two supercomputers. Such a study, performed with up to 8,192 cores,
provides precise guidelines to assemble the fastest solver within the ScaLAPACK
framework; it also indicates that EleMRRR outperforms even the fastest solvers
built from ScaLAPACK's components
An a posteriori verification method for generalized real-symmetric eigenvalue problems in large-scale electronic state calculations
An a posteriori verification method is proposed for the generalized
real-symmetric eigenvalue problem and is applied to densely clustered
eigenvalue problems in large-scale electronic state calculations. The proposed
method is realized by a two-stage process in which the approximate solution is
computed by existing numerical libraries and is then verified in a moderate
computational time. The procedure returns intervals containing one exact
eigenvalue in each interval. Test calculations were carried out for organic
device materials, and the verification method confirms that all exact
eigenvalues are well separated in the obtained intervals. This verification
method will be integrated into EigenKernel (https://github.com/eigenkernel/),
which is middleware for various parallel solvers for the generalized eigenvalue
problem. Such an a posteriori verification method will be important in future
computational science.Comment: 15 pages, 7 figure
Fast Parallel Randomized QR with Column Pivoting Algorithms for Reliable Low-rank Matrix Approximations
Factorizing large matrices by QR with column pivoting (QRCP) is substantially
more expensive than QR without pivoting, owing to communication costs required
for pivoting decisions. In contrast, randomized QRCP (RQRCP) algorithms have
proven themselves empirically to be highly competitive with high-performance
implementations of QR in processing time, on uniprocessor and shared memory
machines, and as reliable as QRCP in pivot quality.
We show that RQRCP algorithms can be as reliable as QRCP with failure
probabilities exponentially decaying in oversampling size. We also analyze
efficiency differences among different RQRCP algorithms. More importantly, we
develop distributed memory implementations of RQRCP that are significantly
better than QRCP implementations in ScaLAPACK.
As a further development, we introduce the concept of and develop algorithms
for computing spectrum-revealing QR factorizations for low-rank matrix
approximations, and demonstrate their effectiveness against leading low-rank
approximation methods in both theoretical and numerical reliability and
efficiency.Comment: 11 pages, 14 figures, accepted by 2017 IEEE 24th International
Conference on High Performance Computing (HiPC), awarded the best paper priz
- …