614 research outputs found

    An Efficient Algorithm For Simulating Fracture Using Large Fuse Networks

    Full text link
    The high computational cost involved in modeling of the progressive fracture simulations using large discrete lattice networks stems from the requirement to solve {\it a new large set of linear equations} every time a new lattice bond is broken. To address this problem, we propose an algorithm that combines the multiple-rank sparse Cholesky downdating algorithm with the rank-p inverse updating algorithm based on the Sherman-Morrison-Woodbury formula for the simulation of progressive fracture in disordered quasi-brittle materials using discrete lattice networks. Using the present algorithm, the computational complexity of solving the new set of linear equations after breaking a bond reduces to the same order as that of a simple {\it backsolve} (forward elimination and backward substitution) {\it using the already LU factored matrix}. That is, the computational cost is O(nnz(L))O(nnz({\bf L})), where nnz(L)nnz({\bf L}) denotes the number of non-zeros of the Cholesky factorization L{\bf L} of the stiffness matrix A{\bf A}. This algorithm using the direct sparse solver is faster than the Fourier accelerated preconditioned conjugate gradient (PCG) iterative solvers, and eliminates the {\it critical slowing down} associated with the iterative solvers that is especially severe close to the critical points. Numerical results using random resistor networks substantiate the efficiency of the present algorithm.Comment: 15 pages including 1 figure. On page pp11407 of the original paper (J. Phys. A: Math. Gen. 36 (2003) 11403-11412), Eqs. 11 and 12 were misprinted that went unnoticed during the proof reading stag

    Minimizing Communication in Linear Algebra

    Full text link
    In 1981 Hong and Kung proved a lower bound on the amount of communication needed to perform dense, matrix-multiplication using the conventional O(n3)O(n^3) algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it to the parallel case. In both cases the lower bound may be expressed as Ω\Omega(#arithmetic operations / M\sqrt{M}), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, LDLTLDL^T factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain.Comment: 27 pages, 2 table

    On solving trust-region and other regularised subproblems in optimization

    Get PDF
    The solution of trust-region and regularisation subproblems which arise in unconstrained optimization is considered. Building on the pioneering work of Gay, Mor´e and Sorensen, methods which obtain the solution of a sequence of parametrized linear systems by factorization are used. Enhancements using high-order polynomial approximation and inverse iteration ensure that the resulting method is both globally and asymptotically at least superlinearly convergent in all cases, including in the notorious hard case. Numerical experiments validate the effectiveness of our approach. The resulting software is available as packages TRS and RQS as part of the GALAHAD optimization library, and is especially designed for large-scale problems

    Parallel Selected Inversion for Space-Time Gaussian Markov Random Fields

    Full text link
    Performing a Bayesian inference on large spatio-temporal models requires extracting inverse elements of large sparse precision matrices for marginal variances. Although direct matrix factorizations can be used for the inversion, such methods fail to scale well for distributed problems when run on large computing clusters. On the contrary, Krylov subspace methods for the selected inversion have been gaining traction. We propose a parallel hybrid approach based on domain decomposition, which extends the Rao-Blackwellized Monte Carlo estimator for distributed precision matrices. Our approach exploits the strength of Krylov subspace methods as global solvers and efficiency of direct factorizations as base case solvers to compute the marginal variances using a divide-and-conquer strategy. By introducing subdomain overlaps, one can achieve a greater accuracy at an increased computational effort with little to no additional communication. We demonstrate the speed improvements on both simulated models and a massive US daily temperature data.Comment: 17 pages, 7 figure

    Integrated Nested Laplace Approximations for Large-Scale Spatial-Temporal Bayesian Modeling

    Full text link
    Bayesian inference tasks continue to pose a computational challenge. This especially holds for spatial-temporal modeling where high-dimensional latent parameter spaces are ubiquitous. The methodology of integrated nested Laplace approximations (INLA) provides a framework for performing Bayesian inference applicable to a large subclass of additive Bayesian hierarchical models. In combination with the stochastic partial differential equations (SPDE) approach it gives rise to an efficient method for spatial-temporal modeling. In this work we build on the INLA-SPDE approach, by putting forward a performant distributed memory variant, INLA-DIST, for large-scale applications. To perform the arising computational kernel operations, consisting of Cholesky factorizations, solving linear systems, and selected matrix inversions, we present two numerical solver options, a sparse CPU-based library and a novel blocked GPU-accelerated approach which we propose. We leverage the recurring nonzero block structure in the arising precision (inverse covariance) matrices, which allows us to employ dense subroutines within a sparse setting. Both versions of INLA-DIST are highly scalable, capable of performing inference on models with millions of latent parameters. We demonstrate their accuracy and performance on synthetic as well as real-world climate dataset applications.Comment: 22 pages, 14 figure

    Evaluation of Distributed Programming Models and Extensions to Task-based Runtime Systems

    Get PDF
    High Performance Computing (HPC) has always been a key foundation for scientific simulation and discovery. And more recently, deep learning models\u27 training have further accelerated the demand of computational power and lower precision arithmetic. In this era following the end of Dennard\u27s Scaling and when Moore\u27s Law seemingly still holds true to a lesser extent, it is not a coincidence that HPC systems are equipped with multi-cores CPUs and a variety of hardware accelerators that are all massively parallel. Coupling this with interconnect networks\u27 speed improvements lagging behind those of computational power increases, the current state of HPC systems is heterogeneous and extremely complex. This was heralded as a great challenge to the software stacks and their ability to extract performance from these systems, but also as a great opportunity to innovate at the programming model level to explore the different approaches and propose new solutions. With usability, portability, and performance as the main factors to consider, this dissertation first evaluates some of the widely used parallel programming models (MPI, MPI+OpenMP, and task-based runtime systems) ability to manage the load imbalance among the processes computing the LU factorization of a large dense matrix stored in the Block Low-Rank (BLR) format. Next I proposed a number of optimizations and implemented them in PaRSEC\u27s Dynamic Task Discovery (DTD) model, including user-level graph trimming and direct Application Programming Interface (API) calls to perform data broadcast operation to further extend the limit of STF model. On the other hand, the Parameterized Task Graph (PTG) approach in PaRSEC is the most scalable approach for many different applications, which I then explored the possibility of combining both the algorithmic approach of Communication-Avoiding (CA) and the communication-computation overlapping benefits provided by runtime systems using 2D five-point stencil as the test case. This broad programming models evaluation and extension work highlighted the abilities of task-based runtime system in achieving scalable performance and portability on contemporary heterogeneous HPC systems. Finally, I summarized the profiling capability of PaRSEC runtime system, and demonstrated with a use case its important role in the performance bottleneck identification leading to optimizations
    corecore