356 research outputs found

    On the parallel solution of parabolic equations

    Get PDF
    Parallel algorithms for the solution of linear parabolic problems are proposed. The first of these methods is based on using polynomial approximation to the exponential. It does not require solving any linear systems and is highly parallelizable. The two other methods proposed are based on Pade and Chebyshev approximations to the matrix exponential. The parallelization of these methods is achieved by using partial fraction decomposition techniques to solve the resulting systems and thus offers the potential for increased time parallelism in time dependent problems. Experimental results from the Alliant FX/8 and the Cray Y-MP/832 vector multiprocessors are also presented

    Parallelizable sparse inverse formulation Gaussian processes (SpInGP)

    Full text link
    We propose a parallelizable sparse inverse formulation Gaussian process (SpInGP) for temporal models. It uses a sparse precision GP formulation and sparse matrix routines to speed up the computations. Due to the state-space formulation used in the algorithm, the time complexity of the basic SpInGP is linear, and because all the computations are parallelizable, the parallel form of the algorithm is sublinear in the number of data points. We provide example algorithms to implement the sparse matrix routines and experimentally test the method using both simulated and real data.Comment: Presented at Machine Learning in Signal Processing (MLSP2017

    A Fast Solver for Large Tridiagonal Systems on Multi-Core Processors (Lass Library)

    Get PDF
    [Abstract]: Many problems of industrial and scientific interest require the solving of tridiagonal linear systems. This paper presents several implementations for the parallel solving of large tridiagonal systems on multi-core architectures, using the OmpSs programming model. The strategy used for the parallelization is based on the combination of two different existing algorithms, PCR and Thomas. The Thomas algorithm, which cannot be parallelized, requires the fewest number of floating point operations. The PCR algorithm is the most popular parallel method, but it is more computationally expensive than Thomas. The method proposed in this paper starts applying the PCR algorithm to break down one large tridiagonal system into a set of smaller and independent ones. In a second step, these independent systems are concurrently solved using Thomas. The paper also contains an analytical study of which is the best point to switch from PCR to Thomas. Also, the paper addresses the main performance issues of combining PCR and Thomas proposing a set of alternative implementations, some of them even imply algorithmic changes. The performance evaluation shows that the best implementation achieves a peak speedup of 4 with respect to the Intel MKL counterpart routine and 2.5 with respect to a single-threaded Thomas.This work was supported in part by the European Union’s Horizon 2020 Framework Programme for Research and Innovation under the Specific Grant Agreements Human Brain Project SGA1 and Human Brain Project SGA2 under Grant 720270 and Grant 785907, in part by the Spanish Ministry of Economy and Competitiveness under the Project Computación de Altas Prestaciones VII under Grant TIN2015-65316-P, in part by the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d’Execució Paralůlels under Grant 2014-SGR-1051, in part by the Juan de la Cierva under Grant IJCI-2017-33511, in part by the Fujitsu under the Barcelona Supercomputing Center-Fujitsu Joint Project: Math Libraries Migration and Optimization, in part by the Ministerio de Economía, Industria y Competitividad of Spain, in part by the Fondo Europeo de Desarrollo Regional Funds of the European Union under Grant TIN2016-75845-P, and in part by the Xunta de Galicia co-founded by the European Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups under Grant ED431C 2017/04, and in part by the Centro Singular de Investigación de Galicia accreditatión 2016-2019 under Grant ED431G/01.Xunta de Galicia; ED431C 2017/04Xunta de Galicia; ED431G/01Generalitat de Catalunya; 2014-SGR-105

    A parallel hybrid implementation of the 2D acoustic wave equation

    Get PDF
    In this paper, we propose a hybrid parallel programming approach for a numerical solution of a two-dimensional acoustic wave equation using an implicit difference scheme for a single computer. The calculations are carried out in an implicit finite difference scheme. First, we transform the differential equation into an implicit finite-difference equation and then using the ADI method, we split the equation into two sub-equations. Using the cyclic reduction algorithm, we calculate an approximate solution. Finally, we change this algorithm to parallelize on GPU, GPU+OpenMP, and Hybrid (GPU+OpenMP+MPI) computing platforms. The special focus is on improving the performance of the parallel algorithms to calculate the acceleration based on the execution time. We show that the code that runs on the hybrid approach gives the expected results by comparing our results to those obtained by running the same simulation on a classical processor core, CUDA, and CUDA+OpenMP implementations.Comment: 10 pages; 1 Chart; 1 Table; 1 Listing; 1 Algorith

    BCYCLIC: A parallel block tridiagonal matrix cyclic solver

    Get PDF
    13 pages, 6 figures.A block tridiagonal matrix is factored with minimal fill-in using a cyclic reduction algorithm that is easily parallelized. Storage of the factored blocks allows the application of the inverse to multiple right-hand sides which may not be known at factorization time. Scalability with the number of block rows is achieved with cyclic reduction, while scalability with the block size is achieved using multithreaded routines (OpenMP, GotoBLAS) for block matrix manipulation. This dual scalability is a noteworthy feature of this new solver, as well as its ability to efficiently handle arbitrary (non-powers-of-2) block row and processor numbers. Comparison with a state-of-the art parallel sparse solver is presented. It is expected that this new solver will allow many physical applications to optimally use the parallel resources on current supercomputers. Example usage of the solver in magneto-hydrodynamic (MHD), three-dimensional equilibrium solvers for high-temperature fusion plasmas is cited.This research has been sponsored by the US Department of Energy under Contract DE-AC05-00OR22725 with UT-Battelle, LLC. This research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725.Publicad

    A new approximate matrix factorization for implicit time integration in air pollution modeling

    Get PDF
    Implicit time stepping typically requires solution of one or several linear systems with a matrix I−τJ per time step where J is the Jacobian matrix. If solution of these systems is expensive, replacing I−τJ with its approximate matrix factorization (AMF) (I−τR)(I−τV), R+V=J, often leads to a good compromise between stability and accuracy of the time integration on the one hand and its efficiency on the other hand. For example, in air pollution modeling, AMF has been successfully used in the framework of Rosenbrock schemes. The standard AMF gives an approximation to I−τJ with the error τ2RV, which can be significant in norm. In this paper we propose a new AMF. In assumption that −V is an M-matrix, the error of the new AMF can be shown to have an upper bound τ||R||, while still being asymptotically O(τ2)O(\tau^2). This new AMF, called AMF+, is equal in costs to standard AMF and, as both analysis and numerical experiments reveal, provides a better accuracy. We also report on our experience with another, cheaper AMF and with AMF-preconditioned GMRES

    Some fast elliptic solvers on parallel architectures and their complexities

    Get PDF
    The discretization of separable elliptic partial differential equations leads to linear systems with special block triangular matrices. Several methods are known to solve these systems, the most general of which is the Block Cyclic Reduction (BCR) algorithm which handles equations with nonconsistant coefficients. A method was recently proposed to parallelize and vectorize BCR. Here, the mapping of BCR on distributed memory architectures is discussed, and its complexity is compared with that of other approaches, including the Alternating-Direction method. A fast parallel solver is also described, based on an explicit formula for the solution, which has parallel computational complexity lower than that of parallel BCR
    • …
    corecore