Search CORE

500 research outputs found

Load-balanced parallel banded-system solvers

Author: Chung Kuo-Liang
Wu Jung-Gen
Yan Wen-Ming
Publication venue: Elsevier Science B.V.
Publication date: 23/10/2002
Field of study

AbstractSolving banded systems is important in the applications of science and engineering. This paper presents a load-balancing strategy for solving banded systems in parallel when the number of processors used is small. An optimization-based load-balancing analysis is given to determine how many loads should be assigned to each processor in order to minimize the time requirement. Some experimentations are carried out on the nCUBE 2E multiprocessor to demonstrate the speedup advantage of the proposed load-balancing strategy. The speedup improvement ratio ranges from 47% to 66% (from 12% to 24%) when using 4 (8) processors

Elsevier - Publisher Connector

Analysis of A Splitting Approach for the Parallel Solution of Linear Systems on GPU Cards

Author: Li Ang
Negrut Dan
Serban Radu
Publication venue
Publication date: 25/09/2015
Field of study

We discuss an approach for solving sparse or dense banded linear systems

{\bf A} {\bf x} = {\bf b}

on a Graphics Processing Unit (GPU) card. The matrix

{\bf A} \in {\mathbb{R}}^{N \times N}

is possibly nonsymmetric and moderately large; i.e.,

10000 \leq N \leq 500000

. The ${\it split\ and\ parallelize}

(

{\tt SaP}

) approach seeks to partition the matrix

{\bf A}

into diagonal sub-blocks

{\bf A}_i

,

i=1,\ldots,P

, which are independently factored in parallel. The solution may choose to consider or to ignore the matrices that couple the diagonal sub-blocks

{\bf A}_i

. This approach, along with the Krylov subspace-based iterative method that it preconditions, are implemented in a solver called

{\tt SaP::GPU}

, which is compared in terms of efficiency with three commonly used sparse direct solvers:

{\tt PARDISO}

,

{\tt SuperLU}

, and

{\tt MUMPS}

.

{\tt SaP::GPU}

, which runs entirely on the GPU except several stages involved in preliminary row-column permutations, is robust and compares well in terms of efficiency with the aforementioned direct solvers. In a comparison against Intel's

{\tt MKL}

,

{\tt SaP::GPU}

also fares well when used to solve dense banded systems that are close to being diagonally dominant.

{\tt SaP::GPU}$ is publicly available and distributed as open source under a permissive BSD3 license.Comment: 38 page

arXiv.org e-Print Archive

CiteSeerX

Distributed Finite Element Analysis Using a Transputer Network

Author: Baehmann Peggy
Danial Albert
Favenesi James
Reynolds Brian
Shephard Mark
Tombrello Joseph
Turrentine Ronald
Watson James
Yang Dabby
Publication venue
Publication date
Field of study

The principal objective of this research effort was to demonstrate the extraordinarily cost effective acceleration of finite element structural analysis problems using a transputer-based parallel processing network. This objective was accomplished in the form of a commercially viable parallel processing workstation. The workstation is a desktop size, low-maintenance computing unit capable of supercomputer performance yet costs two orders of magnitude less. To achieve the principal research objective, a transputer based structural analysis workstation termed XPFEM was implemented with linear static structural analysis capabilities resembling commercially available NASTRAN. Finite element model files, generated using the on-line preprocessing module or external preprocessing packages, are downloaded to a network of 32 transputers for accelerated solution. The system currently executes at about one third Cray X-MP24 speed but additional acceleration appears likely. For the NASA selected demonstration problem of a Space Shuttle main engine turbine blade model with about 1500 nodes and 4500 independent degrees of freedom, the Cray X-MP24 required 23.9 seconds to obtain a solution while the transputer network, operated from an IBM PC-AT compatible host computer, required 71.7 seconds. Consequently, the

80,000 transputer network demonstrated a cost-performance ratio about 60 times better than the

15,000,000 Cray X-MP24 system

NASA Technical Reports Server

Algebraic, Block and Multiplicative Preconditioners based on Fast Tridiagonal Solves on GPUs

Author: Klein Christoph Julian
Publication venue
Publication date: 01/01/2023
Field of study

This thesis contributes to the field of sparse linear algebra, graph applications, and preconditioners for Krylov iterative solvers of sparse linear equation systems, by providing a (block) tridiagonal solver library, a generalized sparse matrix-vector implementation, a linear forest extraction, and a multiplicative preconditioner based on tridiagonal solves. The tridiagonal library, which supports (scaled) partial pivoting, outperforms cuSPARSE's tridiagonal solver by factor five while completely utilizing the available GPU memory bandwidth. For the performance optimized solving of multiple right-hand sides, the explicit factorization of the tridiagonal matrix can be computed. The extraction of a weighted linear forest (union of disjoint paths) from a general graph is used to build algebraic (block) tridiagonal preconditioners and deploys the generalized sparse-matrix vector implementation of this thesis for preconditioner construction. During linear forest extraction, a new parallel bidirectional scan pattern, which can operate on double-linked list structures, identifies the path ID and the position of a vertex. The algebraic preconditioner construction is also used to build more advanced preconditioners, which contain multiple tridiagonal factors, based on generalized ILU factorizations. Additionally, other preconditioners based on tridiagonal factors are presented and evaluated in comparison to ILU and ILU incomplete sparse approximate inverse preconditioners (ILU-ISAI) for the solution of large sparse linear equation systems from the Sparse Matrix Collection. For all presented problems of this thesis, an efficient parallel algorithm and its CUDA implementation for single GPU systems is provided

Heidelberger Dokumentenserver

Recent Advances in Graph Partitioning

Author: A Buluç
A Felner
A George
A Lisser
A Pothen
A Trifunović
AB Kahng
AE Feldmann
AH Land
AJ Soper
B Brandfass
B Hendrickson
B Hendrickson
B Hendrickson
B Junker
B Monien
B Peng
BW Kernighan
C Aykanat
C Chevalier
C Chevalier
C Farhat
C Lanczos
C Walshaw
C Walshaw
C Walshaw
C Walshaw
C Walshaw
C Walshaw
CE Bichot
CE Ferreira
D Delling
D Delling
D Delling
D Drake
D Luxen
D Ron
D Ron
D Wagner
DA Papa
DE Drake Vinkemeier
E Jeannot
E Rolland
F Comellas
F Glover
F Glover
F Pellegrini
F Pellegrini
F Pellegrini
F Schulz
FT Leighton
G Even
G Karypis
G Karypis
G Karypis
G Zumbusch
H Li
H Meyerhenke
H Meyerhenke
H Meyerhenke
H Meyerhenke
H Meyerhenke
HD Simon
HD Simon
I Moulitsas
I Safro
I Safro
J Chen
J Cong
J Fietz
J Hromkovič
J Hungershöfer
J Maue
J Maue
J Shalf
JR Gilbert
K Andreev
K Lang
K Schloegel
K Schloegel
K Schloegel
KS Camilus
L Brunetta
L Grady
L Lovász
LA Sanchis
LR Ford
M Armbruster
M Bader
M Birn
M Fiedler
M Jerrum
M Newman
M Sellmann
M Zhou
MR Garey
N Sensen
O Goldschmidt
P Chardaire
P Galinier
P Korosec
P Sanders
P Sanders
R Diekmann
R Diekmann
R Glantz
R Preis
RD Williams
S Arora
S Huang
S Lafon
S Lloyd
S Pettie
SE Karisch
SY Chan
T Bui
T Kieritz
U Benlic
U Benlic
U Feige
V Osipov
WE Donath
WE Donath
WW Hager
WW Hager
X Sui
Y Low
YM Kim
Ü Çatalyürek
Publication venue
Publication date: 03/02/2015
Field of study

We survey recent trends in practical algorithms for balanced graph partitioning together with applications and future research directions

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Recommended from our members

Enhanced Capabilities of the Spike Algorithm and a New Spike-OpenMP Solver

Author: Spring Braegan S
Publication venue: ScholarWorks@UMass Amherst
Publication date: 07/11/2014
Field of study

SPIKE is a parallel algorithm to solve block tridiagonal matrices. In this work, two useful improvements to the algorithm are proposed. A flexible threading strategy is developed, to overcome limitations of the recursive reduced system method. Allo- cating multiple threads to some tasks created by the SPIKE algorithm removes the previous restriction that recursive SPIKE may only use a number of threads equal to a power of two. Additionally, a method of solving transpose problems is shown. This method matches the performance of the non-transpose solve while reusing the original factorization

ScholarWorks@UMass Amherst

Solution of partial differential equations on vector and parallel computers

Author: Ortega J. M.
Voigt R. G.
Publication venue
Publication date
Field of study

The present status of numerical methods for partial differential equations on vector and parallel computers was reviewed. The relevant aspects of these computers are discussed and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selection. Both direct and iterative methods are given for elliptic equations as well as explicit and implicit methods for initial boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restrictions or the lack of adequate algorithms. Application areas utilizing these computers are briefly discussed

NASA Technical Reports Server