614 research outputs found
An Efficient Algorithm For Simulating Fracture Using Large Fuse Networks
The high computational cost involved in modeling of the progressive fracture
simulations using large discrete lattice networks stems from the requirement to
solve {\it a new large set of linear equations} every time a new lattice bond
is broken. To address this problem, we propose an algorithm that combines the
multiple-rank sparse Cholesky downdating algorithm with the rank-p inverse
updating algorithm based on the Sherman-Morrison-Woodbury formula for the
simulation of progressive fracture in disordered quasi-brittle materials using
discrete lattice networks. Using the present algorithm, the computational
complexity of solving the new set of linear equations after breaking a bond
reduces to the same order as that of a simple {\it backsolve} (forward
elimination and backward substitution) {\it using the already LU factored
matrix}. That is, the computational cost is , where denotes the number of non-zeros of the Cholesky factorization of
the stiffness matrix . This algorithm using the direct sparse solver
is faster than the Fourier accelerated preconditioned conjugate gradient (PCG)
iterative solvers, and eliminates the {\it critical slowing down} associated
with the iterative solvers that is especially severe close to the critical
points. Numerical results using random resistor networks substantiate the
efficiency of the present algorithm.Comment: 15 pages including 1 figure. On page pp11407 of the original paper
(J. Phys. A: Math. Gen. 36 (2003) 11403-11412), Eqs. 11 and 12 were
misprinted that went unnoticed during the proof reading stag
Minimizing Communication in Linear Algebra
In 1981 Hong and Kung proved a lower bound on the amount of communication
needed to perform dense, matrix-multiplication using the conventional
algorithm, where the input matrices were too large to fit in the small, fast
memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and
extended it to the parallel case. In both cases the lower bound may be
expressed as (#arithmetic operations / ), where M is the size
of the fast memory (or local memory in the parallel case). Here we generalize
these results to a much wider variety of algorithms, including LU
factorization, Cholesky factorization, factorization, QR factorization,
algorithms for eigenvalues and singular values, i.e., essentially all direct
methods of linear algebra. The proof works for dense or sparse matrices, and
for sequential or parallel algorithms. In addition to lower bounds on the
amount of data moved (bandwidth) we get lower bounds on the number of messages
required to move it (latency). We illustrate how to extend our lower bound
technique to compositions of linear algebra operations (like computing powers
of a matrix), to decide whether it is enough to call a sequence of simpler
optimal algorithms (like matrix multiplication) to minimize communication, or
if we can do better. We give examples of both. We also show how to extend our
lower bounds to certain graph theoretic problems.
We point out recently designed algorithms for dense LU, Cholesky, QR,
eigenvalue and the SVD problems that attain these lower bounds; implementations
of LU and QR show large speedups over conventional linear algebra algorithms in
standard libraries like LAPACK and ScaLAPACK. Many open problems remain.Comment: 27 pages, 2 table
On solving trust-region and other regularised subproblems in optimization
The solution of trust-region and regularisation subproblems which arise in unconstrained optimization is considered. Building on the pioneering work of Gay, Mor´e and Sorensen, methods which obtain the solution of a sequence of parametrized linear systems by factorization are used. Enhancements using high-order polynomial approximation and inverse iteration ensure that the resulting method is both globally and asymptotically at least superlinearly convergent in all cases, including in the notorious hard case. Numerical experiments validate the effectiveness of our approach. The resulting software is available as packages TRS and RQS as part of the GALAHAD optimization library, and is especially designed for large-scale problems
Parallel Selected Inversion for Space-Time Gaussian Markov Random Fields
Performing a Bayesian inference on large spatio-temporal models requires
extracting inverse elements of large sparse precision matrices for marginal
variances. Although direct matrix factorizations can be used for the inversion,
such methods fail to scale well for distributed problems when run on large
computing clusters. On the contrary, Krylov subspace methods for the selected
inversion have been gaining traction. We propose a parallel hybrid approach
based on domain decomposition, which extends the Rao-Blackwellized Monte Carlo
estimator for distributed precision matrices. Our approach exploits the
strength of Krylov subspace methods as global solvers and efficiency of direct
factorizations as base case solvers to compute the marginal variances using a
divide-and-conquer strategy. By introducing subdomain overlaps, one can achieve
a greater accuracy at an increased computational effort with little to no
additional communication. We demonstrate the speed improvements on both
simulated models and a massive US daily temperature data.Comment: 17 pages, 7 figure
Integrated Nested Laplace Approximations for Large-Scale Spatial-Temporal Bayesian Modeling
Bayesian inference tasks continue to pose a computational challenge. This
especially holds for spatial-temporal modeling where high-dimensional latent
parameter spaces are ubiquitous. The methodology of integrated nested Laplace
approximations (INLA) provides a framework for performing Bayesian inference
applicable to a large subclass of additive Bayesian hierarchical models. In
combination with the stochastic partial differential equations (SPDE) approach
it gives rise to an efficient method for spatial-temporal modeling. In this
work we build on the INLA-SPDE approach, by putting forward a performant
distributed memory variant, INLA-DIST, for large-scale applications. To perform
the arising computational kernel operations, consisting of Cholesky
factorizations, solving linear systems, and selected matrix inversions, we
present two numerical solver options, a sparse CPU-based library and a novel
blocked GPU-accelerated approach which we propose. We leverage the recurring
nonzero block structure in the arising precision (inverse covariance) matrices,
which allows us to employ dense subroutines within a sparse setting. Both
versions of INLA-DIST are highly scalable, capable of performing inference on
models with millions of latent parameters. We demonstrate their accuracy and
performance on synthetic as well as real-world climate dataset applications.Comment: 22 pages, 14 figure
Evaluation of Distributed Programming Models and Extensions to Task-based Runtime Systems
High Performance Computing (HPC) has always been a key foundation for scientific simulation and discovery. And more recently, deep learning models\u27 training have further accelerated the demand of computational power and lower precision arithmetic. In this era following the end of Dennard\u27s Scaling and when Moore\u27s Law seemingly still holds true to a lesser extent, it is not a coincidence that HPC systems are equipped with multi-cores CPUs and a variety of hardware accelerators that are all massively parallel. Coupling this with interconnect networks\u27 speed improvements lagging behind those of computational power increases, the current state of HPC systems is heterogeneous and extremely complex.
This was heralded as a great challenge to the software stacks and their ability to extract performance from these systems, but also as a great opportunity to innovate at the programming model level to explore the different approaches and propose new solutions. With usability, portability, and performance as the main factors to consider, this dissertation first evaluates some of the widely used parallel programming models (MPI, MPI+OpenMP, and task-based runtime systems) ability to manage the load imbalance among the processes computing the LU factorization of a large dense matrix stored in the Block Low-Rank (BLR) format.
Next I proposed a number of optimizations and implemented them in PaRSEC\u27s Dynamic Task Discovery (DTD) model, including user-level graph trimming and direct Application Programming Interface (API) calls to perform data broadcast operation to further extend the limit of STF model. On the other hand, the Parameterized Task Graph (PTG) approach in PaRSEC is the most scalable approach for many different applications, which I then explored the possibility of combining both the algorithmic approach of Communication-Avoiding (CA) and the communication-computation overlapping benefits provided by runtime systems using 2D five-point stencil as the test case. This broad programming models evaluation and extension work highlighted the abilities of task-based runtime system in achieving scalable performance and portability on contemporary heterogeneous HPC systems. Finally, I summarized the profiling capability of PaRSEC runtime system, and demonstrated with a use case its important role in the performance bottleneck identification leading to optimizations
- …