104 research outputs found

    A New Method for Efficient Parallel Solution of Large Linear Systems on a SIMD Processor.

    Get PDF
    This dissertation proposes a new technique for efficient parallel solution of very large linear systems of equations on a SIMD processor. The model problem used to investigate both the efficiency and applicability of the technique was of a regular structure with semi-bandwidth β,\beta, and resulted from approximation of a second order, two-dimensional elliptic equation on a regular domain under the Dirichlet and periodic boundary conditions. With only slight modifications, chiefly to properly account for the mathematical effects of varying bandwidths, the technique can be extended to encompass solution of any regular, banded systems. The computational model used was the MasPar MP-X (model 1208B), a massively parallel processor hostnamed hurricane and housed in the Concurrent Computing Laboratory of the Physics/Astronomy department, Louisiana State University. The maximum bandwidth which caused the problem\u27s size to fit the nyproc ×\times nxproc machine array exactly, was determined. This as well as smaller sizes were used in four experiments to evaluate the efficiency of the new technique. Four benchmark algorithms, two direct--Gauss elimination (GE), Orthogonal factorization--and two iterative--symmetric over-relaxation (SOR) (ω\omega = 2), the conjugate gradient method (CG)--were used to test the efficiency of the new approach based upon three evaluation metrics--deviations of results of computations, measured as average absolute errors, from the exact solution, the cpu times, and the mega flop rates of executions. All the benchmarks, except the GE, were implemented in parallel. In all evaluation categories, the new approach outperformed the benchmarks and very much so when N \gg p, p being the number of processors and N the problem size. At the maximum system\u27s size, the new method was about 2.19 more accurate, and about 1.7 times faster than the benchmarks. But when the system size was a lot smaller than the machine\u27s size, the new approach\u27s performance deteriorated precipitously, and, in fact, in this circumstance, its performance was worse than that of GE, the serial code. Hence, this technique is recommended for solution of linear systems with regular structures on array processors when the problem\u27s size is large in relation to the processor\u27s size

    MPI-CUDA parallel linear solvers for block-tridiagonal matrices in the context of SLEPc's eigensolvers

    Full text link
    [EN] We consider the computation of a few eigenpairs of a generalized eigenvalue problem Ax = lambda Bx with block-tridiagonal matrices, not necessarily symmetric, in the context of Krylov methods. In this kind of computation, it is often necessary to solve a linear system of equations in each iteration of the eigensolver, for instance when B is not the identity matrix or when computing interior eigenvalues with the shift-and-invert spectral transformation. In this work, we aim to compare different direct linear solvers that can exploit the block-tridiagonal structure. Block cyclic reduction and the Spike algorithm are considered. A parallel implementation based on MPI is developed in the context of the SLEPc library. The use of GPU devices to accelerate local computations shows to be competitive for large block sizes.This work was supported by Agencia Estatal de Investigacion (AEI) under grant TIN2016-75985-P, which includes European Commission ERDF funds. Alejandro Lamas Davina was supported by the Spanish Ministry of Education, Culture and Sport through a grant with reference FPU13-06655.Lamas Daviña, A.; Roman, JE. (2018). MPI-CUDA parallel linear solvers for block-tridiagonal matrices in the context of SLEPc's eigensolvers. Parallel Computing. 74:118-135. https://doi.org/10.1016/j.parco.2017.11.006S1181357

    Solution of partial differential equations on vector and parallel computers

    Get PDF
    The present status of numerical methods for partial differential equations on vector and parallel computers was reviewed. The relevant aspects of these computers are discussed and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selection. Both direct and iterative methods are given for elliptic equations as well as explicit and implicit methods for initial boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restrictions or the lack of adequate algorithms. Application areas utilizing these computers are briefly discussed

    Algebraic, Block and Multiplicative Preconditioners based on Fast Tridiagonal Solves on GPUs

    Get PDF
    This thesis contributes to the field of sparse linear algebra, graph applications, and preconditioners for Krylov iterative solvers of sparse linear equation systems, by providing a (block) tridiagonal solver library, a generalized sparse matrix-vector implementation, a linear forest extraction, and a multiplicative preconditioner based on tridiagonal solves. The tridiagonal library, which supports (scaled) partial pivoting, outperforms cuSPARSE's tridiagonal solver by factor five while completely utilizing the available GPU memory bandwidth. For the performance optimized solving of multiple right-hand sides, the explicit factorization of the tridiagonal matrix can be computed. The extraction of a weighted linear forest (union of disjoint paths) from a general graph is used to build algebraic (block) tridiagonal preconditioners and deploys the generalized sparse-matrix vector implementation of this thesis for preconditioner construction. During linear forest extraction, a new parallel bidirectional scan pattern, which can operate on double-linked list structures, identifies the path ID and the position of a vertex. The algebraic preconditioner construction is also used to build more advanced preconditioners, which contain multiple tridiagonal factors, based on generalized ILU factorizations. Additionally, other preconditioners based on tridiagonal factors are presented and evaluated in comparison to ILU and ILU incomplete sparse approximate inverse preconditioners (ILU-ISAI) for the solution of large sparse linear equation systems from the Sparse Matrix Collection. For all presented problems of this thesis, an efficient parallel algorithm and its CUDA implementation for single GPU systems is provided

    A bibliography on parallel and vector numerical algorithms

    Get PDF
    This is a bibliography of numerical methods. It also includes a number of other references on machine architecture, programming language, and other topics of interest to scientific computing. Certain conference proceedings and anthologies which have been published in book form are listed also

    Parallel prefix operations on heterogeneous platforms

    Get PDF
    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo] As tarxetas gráficas, coñecidas como GPUs, aportan grandes vantaxes no rendemento computacional e na eficiencia enerxética, sendo un piar clave para a computación de altas prestacións (HPC). Sen embargo, esta tecnoloxía tamén é custosa de programar, e ten certos problemas asociados á portabilidade entre as diferentes tarxetas. Por autra banda, os algoritmos de prefixo paralelo son un conxunto de algoritmos paralelos regulares e moi empregados nas ciencias compuacionais, cuxa eficiencia é esencial en moita."3 aplicacións. Neste eiclo, aínda que as GPUs poden acelerar a computación destes algoritmos, tamén poden ser unha limitación cando non explotan axeitadamente o paralelismo da arquitectura CPU. Esta Tese presenta dúas perspectivas. Dunha parte, deséñanse novos algoritmos de prefixo paralelo para calquera paradigma de programación paralela. Pola outra banda, tamén se propón unha metodoloxÍa xeral que implementa eficientemente algoritmos de prefixo paralelos, de xeito doado e portable, sobre arquitecturas GPU CUDA, mais que se centrar nun algoritmo particular ou nun modelo concreto de tarxeta. Para isto, a metodoloxía identifica os paramétros da GPU que inflúen no rendemento e, despois, seguindo unha serie de premisas teóricas, obtéñense os valores óptimos destes parámetros dependendo do algoritmo, do tamaño do problema e da arquitectura GPU empregada. Ademais, esta Tese tamén prové unha serie de fUllciólls GPU compostas de bloques de código CUDA modulares e reutilizables, o que permite a implementación de calquera algoritmo de xeito sinxelo. Segundo o tamaño do problema, propóñense tres aproximacións. As dúas primeiras resolven problemas pequenos, medios e grandes nunha única GPU) mentras que a terceira trata con tamaños extremad8.1nente grandes, usando varias GPUs. As nosas propostas proporcionan uns resultados moi competitivos a nivel de rendemento, mellorando as propostas existentes na bibliografía para as operacións probadas: a primitiva sean, ordenación e a resolución de sistemas tridiagonais.[Resumen] Las tarjetas gráficas (GPUs) han demostrado gmndes ventajas en el rendimiento computacional y en la eficiencia energética, siendo una tecnología clave para la computación de altas prestaciones (HPC). Sin embargo, esta tecnología también es costosa de progTamar, y tiene ciertos problemas asociados a la portabilidad de sus códigos entre diferentes generaciones de tarjetas. Por otra parte, los algoritmos de prefijo paralelo son un conjunto de algoritmos regulares y muy utilizados en las ciencias computacionales, cuya eficiencia es crucial en muchas aplicaciones. Aunque las GPUs puedan acelerar la computación de estos algoritmos, también pueden ser una limitación si no explotan correctamente el paralelismo de la arquitectura CPU. Esta Tesis presenta dos perspectivas. De un lado, se han diseñado nuevos algoritmos de prefijo paralelo que pueden ser implementados en cualquier paradigma de programación paralela. Por otra parte, se propone una metodología general que implementa eficientemente algoritmos de prefijo paralelo, de forma sencilla y portable, sobre cualquier arquitectura GPU CUDA, sin centrarse en un algoritmo particular o en un modelo de tarjeta. Para ello, la metodología identifica los parámetros GPU que influyen en el rendimiento y, siguiendo un conjunto de premisas teóricas, obtiene los valores óptimos para cada algoritmo, tamaño de problema y arquitectura. Además, las funciones GPU proporcionadas están compuestas de bloques de código CUDA reutilizable y modular, lo que permite la implementación de cualquier algoritmo de prefijo paralelo sencillamente. Dependiendo del tamaño del problema, se proponen tres aproximaciones. Las dos primeras resuelven tamaños pequeños, medios y grandes, utilizando para ello una única GPU i mientras que la tercera aproximación trata con tamaños extremadamente grandes, usando varias GPUs. Nuestras propuestas proporcionan resultados muy competitivos, mejorando el rendimiento de las propuestas existentes en la bibliografía para las operaciones probadas: la primitiva sean, ordenación y la resolución de sistemas tridiagonales.[Abstract] Craphics Processing Units (CPUs) have shown remarkable advantages in computing performance and energy efficiency, representing oue of the most promising trends fúr the near-fnture of high perfonnance computing. However, these devices also bring sorne programming complexities, and many efforts are required tú provide portability between different generations. Additionally, parallel prefix algorithms are a 8et of regular and highly-used parallel algorithms, whose efficiency is crutial in roany computer sCience applications. Although GPUs can accelerate the computation of such algorithms, they can also be a limitation when they do not match correctly to the CPU architecture or do not exploit the CPU parallelism properly. This dissertation presents two different perspectives. Gn the Oile hand, new parallel prefix algorithms have been algorithmicany designed for any paranel progrannning paradigm. On the other hand, a general tuning CPU methodology is proposed to provide an easy and portable mechanism tú efficiently implement paranel prefix algorithms on any CUDA CPU architecture, rather than focusing on a particular algorithm or a CPU mode!. To accomplish this goal, the methodology identifies the GPU parameters which influence on the performance and, following a set oí performance premises, obtains the cOllvillient values oí these parameters depending on the algorithm, the problem size and the CPU architecture. Additionally, the provided CPU functions are composed of modular and reusable CUDA blocks of code, which allow the easy implementation of any paranel prefix algorithm. Depending on the size of the dataset, three different approaches are proposed. The first two approaches solve small and medium-large datasets on a single GPU; whereas the third approach deals with extremely large datasets on a Multiple-CPU environment. OUT proposals provide very competitive performance, outperforming the stateof- the-art for many parallel prefix operatiOllS, such as the sean primitive, sorting and solving tridiagonal systems

    The method of polarized traces for the 2D Helmholtz equation

    Get PDF
    We present a solver for the 2D high-frequency Helmholtz equation in heterogeneous acoustic media, with online parallel complexity that scales optimally as O(NL), where N is the number of volume unknowns, and L is the number of processors, as long as L grows at most like a small fractional power of N. The solver decomposes the domain into layers, and uses transmission conditions in boundary integral form to explicitly define "polarized traces", i.e., up- and down-going waves sampled at interfaces. Local direct solvers are used in each layer to precompute traces of local Green's functions in an embarrassingly parallel way (the offline part), and incomplete Green's formulas are used to propagate interface data in a sweeping fashion, as a preconditioner inside a GMRES loop (the online part). Adaptive low-rank partitioning of the integral kernels is used to speed up their application to interface data. The method uses second-order finite differences. The complexity scalings are empirical but motivated by an analysis of ranks of off-diagonal blocks of oscillatory integrals. They continue to hold in the context of standard geophysical community models such as BP and Marmousi 2, where convergence occurs in 5 to 10 GMRES iterations. While the parallelism in this paper stems from decomposing the domain, we do not explore the alternative of parallelizing the systems solves with distributed linear algebra routines. Keywords: Domain decomposition; Helmholtz equation; Integral equations; High-frequency; Fast methodsUnited States. Air Force Office of Scientific Research (Grant FA9550-15-1-0078)United States. Office of Naval Research (Grant N00014-13-1-0403)National Science Foundation (U.S.) (Grant DMS-1255203
    corecore