Search CORE

23 research outputs found

A domain decomposing parallel sparse linear system solver

Author: Amestoy
Amestoy
Amestoy
Amestoy
Barrett
Benzi
Benzi
Benzi
Berry
Chen
Dongarra
Dongarra
Gravvanis
Gravvanis
Gravvanis
Karypis
Karypis
Lawrie
Lawson
Li
Manguoglu
Manguoglu
Murat Manguoglu
Polizzi
Polizzi
Sameh
Schenk
Schenk
Schenk
Publication venue: 'Elsevier BV'
Publication date: 26/08/2011
Field of study

The solution of large sparse linear systems is often the most time-consuming part of many science and engineering applications. Computational fluid dynamics, circuit simulation, power network analysis, and material science are just a few examples of the application areas in which large sparse linear systems need to be solved effectively. In this paper we introduce a new parallel hybrid sparse linear system solver for distributed memory architectures that contains both direct and iterative components. We show that by using our solver one can alleviate the drawbacks of direct and iterative solvers, achieving better scalability than with direct solvers and more robustness than with classical preconditioned iterative solvers. Comparisons to well-known direct and iterative solvers on a parallel architecture are provided.Comment: To appear in Journal of Computational and Applied Mathematic

arXiv.org e-Print Archive

Crossref

OpenMETU (Middle East Technical University)

Parallel algorithms and architecture for computation of manipulator forward dynamics

Author: Bejczy Antal K.
Fijany Amir
Publication venue
Publication date
Field of study

Parallel computation of manipulator forward dynamics is investigated. Considering three classes of algorithms for the solution of the problem, that is, the O(n), the O(n exp 2), and the O(n exp 3) algorithms, parallelism in the problem is analyzed. It is shown that the problem belongs to the class of NC and that the time and processors bounds are of O(log2/2n) and O(n exp 4), respectively. However, the fastest stable parallel algorithms achieve the computation time of O(n) and can be derived by parallelization of the O(n exp 3) serial algorithms. Parallel computation of the O(n exp 3) algorithms requires the development of parallel algorithms for a set of fundamentally different problems, that is, the Newton-Euler formulation, the computation of the inertia matrix, decomposition of the symmetric, positive definite matrix, and the solution of triangular systems. Parallel algorithms for this set of problems are developed which can be efficiently implemented on a unique architecture, a triangular array of n(n+2)/2 processors with a simple nearest-neighbor interconnection. This architecture is particularly suitable for VLSI and WSI implementations. The developed parallel algorithm, compared to the best serial O(n) algorithm, achieves an asymptotic speedup of more than two orders-of-magnitude in the computation the forward dynamics

NASA Technical Reports Server

QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

Author: Camille Coti
Camille Coti
Camille Coti
Emmanuel Agullo
Emmanuel Agullo
Emmanuel Agullo
Jack Dongarra
Jack Dongarra
Jack Dongarra
Julien Langou
Julien Langou
Qr Fac
Thomas Herault
Thomas Herault
Thomas Herault
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/12/2009
Field of study

Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in high-performance computing and the use of grids for speeding up large-scale scientific problems is limited to applications exhibiting parallelism at a higher level. We have identified two performance bottlenecks in the distributed memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear algebra library. First, because ScaLAPACK assumes a homogeneous communication network, the implementations of ScaLAPACK algorithms lack locality in their communication pattern. Second, the number of messages sent in the ScaLAPACK algorithms is significantly greater than other algorithms that trade flops for communication. In this paper, we present a new approach for computing a QR factorization -- one of the main dense linear algebra kernels -- of tall and skinny matrices in a grid computing environment that overcomes these two bottlenecks. Our contribution is to articulate a recently proposed algorithm (Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in order to confine intensive communications (ScaLAPACK calls) within the different geographical sites. An experimental study conducted on the Grid'5000 platform shows that the resulting performance increases linearly with the number of geographical sites on large-scale problems (and is in particular consistently higher than ScaLAPACK's).Comment: Accepted at IPDPS10. (IEEE International Parallel & Distributed Processing Symposium 2010 in Atlanta, GA, USA.

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Rennes 1

Some fast elliptic solvers on parallel architectures and their complexities

Author: Gallopoulos E.
Saad Youcef
Publication venue
Publication date
Field of study

The discretization of separable elliptic partial differential equations leads to linear systems with special block triangular matrices. Several methods are known to solve these systems, the most general of which is the Block Cyclic Reduction (BCR) algorithm which handles equations with nonconsistant coefficients. A method was recently proposed to parallelize and vectorize BCR. Here, the mapping of BCR on distributed memory architectures is discussed, and its complexity is compared with that of other approaches, including the Alternating-Direction method. A fast parallel solver is also described, based on an explicit formula for the solution, which has parallel computational complexity lower than that of parallel BCR

NASA Technical Reports Server

Parallel Two-Stage Least Squares algorithms for Simultaneous Equations Models on GPU

Author: Giménez Domingo
López-Espín José J.
Ramiro Sánchez Carla
Vidal Antonio M.
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 06/03/2012
Field of study

Today it is usual to have computational systems formed by a multicore together with one or more GPUs. These systems are heterogeneous, due to the di erent types of memory in the GPUs and to the di erent speeds of computation of the cores in the CPU and the GPU. To accelerate the solution of complex problems it is necessary to combine the two basic components (CPU and GPU) in the heterogeneous system. This paper analyzes the use of a multicore+multiGPU system for solving Simultaneous Equations Models by the Two-Stage Least Squares method with QR decomposition. The combination of CPU and GPU allows us to reduce the execution time in the solution of large SEM.Ramiro Sánchez, C.; López-Espín, JJ.; Giménez, D.; Vidal, AM. (2012). Parallel Two-Stage Least Squares algorithms for Simultaneous Equations Models on GPU. http://hdl.handle.net/10251/1496

RiuNet

Memory-aware i-vector extraction by means of subspace factorization

Author: Cumani S.
Laface P.
Publication venue: IEEE - INST ELECTRICAL ELECTRONICS ENGINEERS INC
Publication date
Field of study

Most of the state–of–the–art speaker recognition systems use i– vectors, a compact representation of spoken utterances. Since the “standard” i–vector extraction procedure requires large memory structures, we recently presented the Factorized Sub-space Estimation (FSE) approach, an efficient technique that dramatically reduces the memory needs for i–vector extraction, and is also fast and accurate compared to other proposed approaches. FSE is based on the approximation of the matrix T, representing the speaker variability sub–space, by means of the product of appropriately designed matrices. In this work, we introduce and evaluate a further approximation of the matrices that most contribute to the memory costs in the FSE approach, showing that it is possible to obtain comparable system accuracy using less than a half of FSE memory, which corresponds to more than 60 times memory reduction with respect to the standard method of i–vector extraction

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Analysis of A Splitting Approach for the Parallel Solution of Linear Systems on GPU Cards

Author: Li Ang
Negrut Dan
Serban Radu
Publication venue
Publication date: 25/09/2015
Field of study

We discuss an approach for solving sparse or dense banded linear systems

{\bf A} {\bf x} = {\bf b}

on a Graphics Processing Unit (GPU) card. The matrix

{\bf A} \in {\mathbb{R}}^{N \times N}

is possibly nonsymmetric and moderately large; i.e.,

10000 \leq N \leq 500000

. The ${\it split\ and\ parallelize}

(

{\tt SaP}

) approach seeks to partition the matrix

{\bf A}

into diagonal sub-blocks

{\bf A}_i

,

i=1,\ldots,P

, which are independently factored in parallel. The solution may choose to consider or to ignore the matrices that couple the diagonal sub-blocks

{\bf A}_i

. This approach, along with the Krylov subspace-based iterative method that it preconditions, are implemented in a solver called

{\tt SaP::GPU}

, which is compared in terms of efficiency with three commonly used sparse direct solvers:

{\tt PARDISO}

,

{\tt SuperLU}

, and

{\tt MUMPS}

.

{\tt SaP::GPU}

, which runs entirely on the GPU except several stages involved in preliminary row-column permutations, is robust and compares well in terms of efficiency with the aforementioned direct solvers. In a comparison against Intel's

{\tt MKL}

,

{\tt SaP::GPU}

also fares well when used to solve dense banded systems that are close to being diagonally dominant.

{\tt SaP::GPU}$ is publicly available and distributed as open source under a permissive BSD3 license.Comment: 38 page

arXiv.org e-Print Archive

CiteSeerX

Recommended from our members

Enhanced Capabilities of the Spike Algorithm and a New Spike-OpenMP Solver

Author: Spring Braegan S
Publication venue: ScholarWorks@UMass Amherst
Publication date: 07/11/2014
Field of study

SPIKE is a parallel algorithm to solve block tridiagonal matrices. In this work, two useful improvements to the algorithm are proposed. A flexible threading strategy is developed, to overcome limitations of the recursive reduced system method. Allo- cating multiple threads to some tasks created by the SPIKE algorithm removes the previous restriction that recursive SPIKE may only use a number of threads equal to a power of two. Additionally, a method of solving transpose problems is shown. This method matches the performance of the non-transpose solve while reusing the original factorization

ScholarWorks@UMass Amherst