356 research outputs found
On the parallel solution of parabolic equations
Parallel algorithms for the solution of linear parabolic problems are proposed. The first of these methods is based on using polynomial approximation to the exponential. It does not require solving any linear systems and is highly parallelizable. The two other methods proposed are based on Pade and Chebyshev approximations to the matrix exponential. The parallelization of these methods is achieved by using partial fraction decomposition techniques to solve the resulting systems and thus offers the potential for increased time parallelism in time dependent problems. Experimental results from the Alliant FX/8 and the Cray Y-MP/832 vector multiprocessors are also presented
Parallelizable sparse inverse formulation Gaussian processes (SpInGP)
We propose a parallelizable sparse inverse formulation Gaussian process
(SpInGP) for temporal models. It uses a sparse precision GP formulation and
sparse matrix routines to speed up the computations. Due to the state-space
formulation used in the algorithm, the time complexity of the basic SpInGP is
linear, and because all the computations are parallelizable, the parallel form
of the algorithm is sublinear in the number of data points. We provide example
algorithms to implement the sparse matrix routines and experimentally test the
method using both simulated and real data.Comment: Presented at Machine Learning in Signal Processing (MLSP2017
A Fast Solver for Large Tridiagonal Systems on Multi-Core Processors (Lass Library)
[Abstract]: Many problems of industrial and scientific interest require the solving of tridiagonal linear systems. This paper presents several implementations for the parallel solving of large tridiagonal systems on multi-core architectures, using the OmpSs programming model. The strategy used for the parallelization is based on the combination of two different existing algorithms, PCR and Thomas. The Thomas algorithm, which cannot be parallelized, requires the fewest number of floating point operations. The PCR algorithm is the most popular parallel method, but it is more computationally expensive than Thomas. The method proposed in this paper starts applying the PCR algorithm to break down one large tridiagonal system into a set of smaller and independent ones. In a second step, these independent systems are concurrently solved using Thomas. The paper also contains an analytical study of which is the best point to switch from PCR to Thomas. Also, the paper addresses the main performance issues of combining PCR and Thomas proposing a set of alternative implementations, some of them even imply algorithmic changes. The performance evaluation shows that the best implementation achieves a peak speedup of 4 with respect to the Intel MKL counterpart routine and 2.5 with respect to a single-threaded Thomas.This work was supported in part by the European Union’s Horizon 2020 Framework Programme for Research and Innovation under the
Specific Grant Agreements Human Brain Project SGA1 and Human Brain Project SGA2 under Grant 720270 and Grant 785907, in part by
the Spanish Ministry of Economy and Competitiveness under the Project Computación de Altas Prestaciones VII under Grant
TIN2015-65316-P, in part by the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project
MPEXPAR: Models de Programació i Entorns d’Execució Paralůlels under Grant 2014-SGR-1051, in part by the Juan de la Cierva under
Grant IJCI-2017-33511, in part by the Fujitsu under the Barcelona Supercomputing Center-Fujitsu Joint Project: Math Libraries Migration
and Optimization, in part by the Ministerio de EconomÃa, Industria y Competitividad of Spain, in part by the Fondo Europeo de Desarrollo
Regional Funds of the European Union under Grant TIN2016-75845-P, and in part by the Xunta de Galicia co-founded by the European
Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups under Grant ED431C
2017/04, and in part by the Centro Singular de Investigación de Galicia accreditatión 2016-2019 under Grant ED431G/01.Xunta de Galicia; ED431C 2017/04Xunta de Galicia; ED431G/01Generalitat de Catalunya; 2014-SGR-105
A parallel hybrid implementation of the 2D acoustic wave equation
In this paper, we propose a hybrid parallel programming approach for a
numerical solution of a two-dimensional acoustic wave equation using an
implicit difference scheme for a single computer. The calculations are carried
out in an implicit finite difference scheme. First, we transform the
differential equation into an implicit finite-difference equation and then
using the ADI method, we split the equation into two sub-equations. Using the
cyclic reduction algorithm, we calculate an approximate solution. Finally, we
change this algorithm to parallelize on GPU, GPU+OpenMP, and Hybrid
(GPU+OpenMP+MPI) computing platforms.
The special focus is on improving the performance of the parallel algorithms
to calculate the acceleration based on the execution time. We show that the
code that runs on the hybrid approach gives the expected results by comparing
our results to those obtained by running the same simulation on a classical
processor core, CUDA, and CUDA+OpenMP implementations.Comment: 10 pages; 1 Chart; 1 Table; 1 Listing; 1 Algorith
BCYCLIC: A parallel block tridiagonal matrix cyclic solver
13 pages, 6 figures.A block tridiagonal matrix is factored with minimal fill-in using a cyclic reduction algorithm that is easily parallelized. Storage of the factored blocks allows the application of the inverse to multiple right-hand sides which may not be known at factorization time. Scalability with the number of block rows is achieved with cyclic reduction, while scalability with the block size is achieved using multithreaded routines (OpenMP, GotoBLAS) for block matrix manipulation. This dual scalability is a noteworthy feature of this new solver, as well as its ability to efficiently handle arbitrary (non-powers-of-2) block row and processor numbers. Comparison with a state-of-the art parallel sparse solver is presented. It is expected that this new solver will allow many physical applications to optimally use the parallel resources on current supercomputers. Example usage of the solver in magneto-hydrodynamic (MHD), three-dimensional equilibrium solvers for high-temperature fusion plasmas is cited.This research has been sponsored by the US Department of Energy under Contract DE-AC05-00OR22725 with UT-Battelle, LLC. This research used resources of the National Center for Computational Sciences at Oak Ridge National Laboratory, which is supported by the Office of Science of the Department of Energy under Contract DE-AC05-00OR22725.Publicad
A new approximate matrix factorization for implicit time integration in air pollution modeling
Implicit time stepping typically requires solution of one or several linear systems with a matrix I−τJ per time step where J is the Jacobian matrix. If solution of these systems is expensive, replacing I−τJ with its approximate matrix factorization (AMF) (I−τR)(I−τV), R+V=J, often leads to a good compromise between stability and accuracy of the time integration on the one hand and its efficiency on the other hand. For example, in air pollution modeling, AMF has been successfully used in the framework of Rosenbrock schemes. The standard AMF gives an approximation to I−τJ with the error τ2RV, which can be significant in norm. In this paper we propose a new AMF. In assumption that −V is an M-matrix, the error of the new AMF can be shown to have an upper bound τ||R||, while still being asymptotically . This new AMF, called AMF+, is equal in costs to standard AMF and, as both analysis and numerical experiments reveal, provides a better accuracy. We also report on our experience with another, cheaper AMF and with AMF-preconditioned GMRES
Some fast elliptic solvers on parallel architectures and their complexities
The discretization of separable elliptic partial differential equations leads to linear systems with special block triangular matrices. Several methods are known to solve these systems, the most general of which is the Block Cyclic Reduction (BCR) algorithm which handles equations with nonconsistant coefficients. A method was recently proposed to parallelize and vectorize BCR. Here, the mapping of BCR on distributed memory architectures is discussed, and its complexity is compared with that of other approaches, including the Alternating-Direction method. A fast parallel solver is also described, based on an explicit formula for the solution, which has parallel computational complexity lower than that of parallel BCR
- …