136 research outputs found
On the impact of communication complexity in the design of parallel numerical algorithms
This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In the second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm independent upper bounds on system performance are derived for several problems that are important to scientific computation
A bibliography on parallel and vector numerical algorithms
This is a bibliography of numerical methods. It also includes a number of other references on machine architecture, programming language, and other topics of interest to scientific computing. Certain conference proceedings and anthologies which have been published in book form are listed also
Algebraic, Block and Multiplicative Preconditioners based on Fast Tridiagonal Solves on GPUs
This thesis contributes to the field of sparse linear algebra, graph applications, and preconditioners for Krylov iterative solvers of sparse linear equation systems, by providing a (block) tridiagonal solver library, a generalized sparse matrix-vector implementation, a linear forest extraction, and a multiplicative preconditioner based on tridiagonal solves. The tridiagonal library, which supports (scaled) partial pivoting, outperforms cuSPARSE's tridiagonal solver by factor five while completely utilizing the available GPU memory bandwidth. For the performance optimized solving of multiple right-hand sides, the explicit factorization of the tridiagonal matrix can be computed. The extraction of a weighted linear forest (union of disjoint paths) from a general graph is used to build algebraic (block) tridiagonal preconditioners and deploys the generalized sparse-matrix vector implementation of this thesis for preconditioner construction. During linear forest extraction, a new parallel bidirectional scan pattern, which can operate on double-linked list structures, identifies the path ID and the position of a vertex. The algebraic preconditioner construction is also used to build more advanced preconditioners, which contain multiple tridiagonal factors, based on generalized ILU factorizations. Additionally, other preconditioners based on tridiagonal factors are presented and evaluated in comparison to ILU and ILU incomplete sparse approximate inverse preconditioners (ILU-ISAI) for the solution of large sparse linear equation systems from the Sparse Matrix Collection. For all presented problems of this thesis, an efficient parallel algorithm and its CUDA implementation for single GPU systems is provided
Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks
We consider the problem of designing optimal and efficient algorithms for solving tridiagonal linear systems with multiple right-hand side vectors on two-dimensional mesh interconnection networks. We derive asymptotic upper and lower bounds for these solvers using odd-even cyclic reduction. We present various important lower bounds on execution time for solving these systems including general lower bounds which are independent of initial data assignment, and lower bounds based on classifications of initial data assignments which classify assignments via the proportion of initial data assigned amongst processors. Finally, different algorithms are designed in order to achieve running times that are within a small constant factor of the lower bounds provided
Towards efficient exploitation of GPUs : a methodology for mapping index-digit algorithms
[Resumen]La computación de propósito general en GPUs supuso un gran paso, llevando la
computación de alto rendimiento a los equipos domésticos. Lenguajes de programación de alto nivel como OpenCL y CUDA redujeron en gran medida la complejidad
de programación. Sin embargo, para poder explotar totalmente el poder computacional
de las GPUs, se requieren algoritmos paralelos especializados. La complejidad
en la jerarquía de memoria y su arquitectura masivamente paralela hace que la
programación de GPUs sea una tarea compleja incluso para programadores experimentados.
Debido a la novedad, las librerías de propósito general son escasas y las
versiones paralelas de los algoritmos no siempre están disponibles.
En lugar de centrarnos en la paralelización de algoritmos concretos, en esta tesis
proponemos una metodología general aplicable a la mayoría de los problemas de tipo
divide y vencerás con una estructura de mariposa que puedan formularse a través de
la representación Indice-Dígito. En primer lugar, se analizan los diferentes factores que afectan al rendimiento de la arquitectura de las GPUs. A continuación, estudiamos
varias técnicas de optimización y diseñamos una serie de bloques constructivos
modulares y reutilizables, que se emplean para crear los diferentes algoritmos. Por último, estudiamos el equilibrio óptimo de los recursos, y usando vectores de mapeo
y operadores algebraicos ajustamos los algoritmos para las configuraciones deseadas.
A pesar del enfoque centrado en la exibilidad y la facilidad de programación, las
implementaciones resultantes ofrecen un rendimiento muy competitivo, que llega a superar conocidas librerías recientes.[Resumo] A computación de propósito xeral en GPUs supuxo un gran paso, levando a
computación de alto rendemento aos equipos domésticos. Linguaxes de programación de alto nivel como OpenCL e CUDA reduciron en boa medida a complexidade
da programación. Con todo, para poder aproveitar totalmente o poder computacional
das GPUs, requírense algoritmos paralelos especializados. A complexidade na
xerarquía de memoria e a súa arquitectura masivamente paralela fai que a programación de GPUs sexa unha tarefa complexa mesmo para programadores experimentados.
Debido á novidade, as librarías de propósito xeral son escasas e as versións
paralelas dos algoritmos non sempre están dispoñibles.
En lugar de centrarnos na paralelización de algoritmos concretos, nesta tese propoñemos unha metodoloxía xeral aplicable á maioría dos problemas de tipo divide e
vencerás cunha estrutura de bolboreta que poidan formularse a través da representación Índice-Díxito. En primeiro lugar, analízanse os diferentes factores que afectan
ao rendemento da arquitectura das GPUs. A continuación, estudamos varias técnicas
de optimización e deseñamos unha serie de bloques construtivos modulares e
reutilizables, que se empregan para crear os diferentes algoritmos. Por último, estudamos
o equilibrio óptimo dos recursos, e usando vectores de mapeo e operadores
alxbricos axustamos os algoritmos para as configuracións desexadas. A pesar do enfoque
centrado na exibilidade e a facilidade de programación, as implementacións
resultantes ofrecen un rendemento moi competitivo, que chega a superar coñecidas
librarías recentes.[Abstract]GPU computing supposed a major step forward, bringing high performance computing
to commodity hardware. Feature-rich parallel languages like CUDA and
OpenCL reduced the programming complexity. However, to fully take advantage of
their computing power, specialized parallel algorithms are required. Moreover, the
complex GPU memory hierarchy and highly threaded architecture makes programming
a difficult task even for experienced programmers. Due to the novelty of GPU
programming, common general purpose libraries are scarce and parallel versions of
the algorithms are not always readily available.
Instead of focusing in the parallelization of particular algorithms, in this thesis
we propose a general methodology applicable to most divide-and-conquer problems
with a buttery structure which can be formulated through the Index-Digit
representation. First, we analyze the different performance factors of the GPU architecture.
Next, we study several optimization techniques and design a series of
modular and reusable building blocks, which will be used to create the different
algorithms. Finally, we study the optimal resource balance, and through a mapping
vector representation and operator algebra, we tune the algorithms for the desired
configurations. Despite the focus on programmability and exibility, the resulting
implementations offer very competitive performance, being able to surpass other
well-known state of the art libraries
Stable Sparse Orthogonal Factorization of Ill-Conditioned Banded Matrices for Parallel Computing
Sequential and parallel algorithms based on the LU factorization or the QR factorization have been intensely studied and widely used in the problems of computation with large-scale ill-conditioned banded matrices. Great concerns on existing methods include ill-conditioning, sparsity of factor matrices, computational complexity, and scalability. In this dissertation, we study a sparse orthogonal factorization of a banded matrix motivated by parallel computing. Specifically, we develop a process to factorize a banded matrix as a product of a sparse orthogonal matrix and a sparse matrix which can be transformed to an upper triangular matrix by column permutations. We prove that the proposed process requires low complexity, and it is numerically stable, maintaining similar stability results as the modified Gram-Schmidt process. On this basis, we develop a parallel algorithm for the factorization in a distributed computing environment. Through an analysis of its performance, we show that the communication costs reach the theoretical least upper bounds, while its parallel complexity or speedup approaches the optimal bound. For an ill-conditioned banded system, we construct a sequential solver that breaks it down into small-scale underdetermined systems, which are solved by the proposed factorization with high accuracy. We also implement a parallel solver with strategies to treat the memory issue appearing in extra large-scale linear systems of size over one billion. Numerical experiments confirm the theoretical results derived in this thesis, and demonstrate the superior accuracy and scalability of the proposed solvers for ill-conditioned linear systems, comparing to the most commonly used direct solvers
Parallel prefix operations on heterogeneous platforms
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo]
As tarxetas gráficas, coñecidas como GPUs, aportan grandes vantaxes no rendemento
computacional e na eficiencia enerxética, sendo un piar clave para a computación
de altas prestacións (HPC). Sen embargo, esta tecnoloxía tamén é custosa
de programar, e ten certos problemas asociados á portabilidade entre as diferentes
tarxetas. Por autra banda, os algoritmos de prefixo paralelo son un conxunto de
algoritmos paralelos regulares e moi empregados nas ciencias compuacionais, cuxa
eficiencia é esencial en moita."3 aplicacións. Neste eiclo, aínda que as GPUs poden
acelerar a computación destes algoritmos, tamén poden ser unha limitación cando
non explotan axeitadamente o paralelismo da arquitectura CPU.
Esta Tese presenta dúas perspectivas. Dunha parte, deséñanse novos algoritmos
de prefixo paralelo para calquera paradigma de programación paralela. Pola outra
banda, tamén se propón unha metodoloxÍa xeral que implementa eficientemente
algoritmos de prefixo paralelos, de xeito doado e portable, sobre arquitecturas GPU
CUDA, mais que se centrar nun algoritmo particular ou nun modelo concreto de
tarxeta. Para isto, a metodoloxía identifica os paramétros da GPU que inflúen no
rendemento e, despois, seguindo unha serie de premisas teóricas, obtéñense os valores
óptimos destes parámetros dependendo do algoritmo, do tamaño do problema e
da arquitectura GPU empregada. Ademais, esta Tese tamén prové unha serie de
fUllciólls GPU compostas de bloques de código CUDA modulares e reutilizables, o
que permite a implementación de calquera algoritmo de xeito sinxelo. Segundo o
tamaño do problema, propóñense tres aproximacións. As dúas primeiras resolven
problemas pequenos, medios e grandes nunha única GPU) mentras que a terceira
trata con tamaños extremad8.1nente grandes, usando varias GPUs.
As nosas propostas proporcionan uns resultados moi competitivos a nivel de
rendemento, mellorando as propostas existentes na bibliografía para as operacións
probadas: a primitiva sean, ordenación e a resolución de sistemas tridiagonais.[Resumen]
Las tarjetas gráficas (GPUs) han demostrado gmndes ventajas en el rendimiento
computacional y en la eficiencia energética, siendo una tecnología clave para la
computación de altas prestaciones (HPC). Sin embargo, esta tecnología también es
costosa de progTamar, y tiene ciertos problemas asociados a la portabilidad de sus
códigos entre diferentes generaciones de tarjetas. Por otra parte, los algoritmos de
prefijo paralelo son un conjunto de algoritmos regulares y muy utilizados en las
ciencias computacionales, cuya eficiencia es crucial en muchas aplicaciones. Aunque
las GPUs puedan acelerar la computación de estos algoritmos, también pueden ser
una limitación si no explotan correctamente el paralelismo de la arquitectura CPU.
Esta Tesis presenta dos perspectivas. De un lado, se han diseñado nuevos algoritmos
de prefijo paralelo que pueden ser implementados en cualquier paradigma de
programación paralela. Por otra parte, se propone una metodología general que implementa
eficientemente algoritmos de prefijo paralelo, de forma sencilla y portable,
sobre cualquier arquitectura GPU CUDA, sin centrarse en un algoritmo particular o
en un modelo de tarjeta. Para ello, la metodología identifica los parámetros GPU que
influyen en el rendimiento y, siguiendo un conjunto de premisas teóricas, obtiene los
valores óptimos para cada algoritmo, tamaño de problema y arquitectura. Además,
las funciones GPU proporcionadas están compuestas de bloques de código CUDA
reutilizable y modular, lo que permite la implementación de cualquier algoritmo de
prefijo paralelo sencillamente. Dependiendo del tamaño del problema, se proponen
tres aproximaciones. Las dos primeras resuelven tamaños pequeños, medios y grandes,
utilizando para ello una única GPU i mientras que la tercera aproximación trata
con tamaños extremadamente grandes, usando varias GPUs.
Nuestras propuestas proporcionan resultados muy competitivos, mejorando el
rendimiento de las propuestas existentes en la bibliografía para las operaciones probadas:
la primitiva sean, ordenación y la resolución de sistemas tridiagonales.[Abstract]
Craphics Processing Units (CPUs) have shown remarkable advantages in computing
performance and energy efficiency, representing oue of the most promising
trends fúr the near-fnture of high perfonnance computing. However, these devices
also bring sorne programming complexities, and many efforts are required tú provide
portability between different generations. Additionally, parallel prefix algorithms
are a 8et of regular and highly-used parallel algorithms, whose efficiency is crutial
in roany computer sCience applications. Although GPUs can accelerate the computation
of such algorithms, they can also be a limitation when they do not match
correctly to the CPU architecture or do not exploit the CPU parallelism properly.
This dissertation presents two different perspectives. Gn the Oile hand, new
parallel prefix algorithms have been algorithmicany designed for any paranel progrannning
paradigm. On the other hand, a general tuning CPU methodology is
proposed to provide an easy and portable mechanism tú efficiently implement paranel
prefix algorithms on any CUDA CPU architecture, rather than focusing on a
particular algorithm or a CPU mode!. To accomplish this goal, the methodology
identifies the GPU parameters which influence on the performance and, following a
set oí performance premises, obtains the cOllvillient values oí these parameters depending
on the algorithm, the problem size and the CPU architecture. Additionally,
the provided CPU functions are composed of modular and reusable CUDA blocks
of code, which allow the easy implementation of any paranel prefix algorithm. Depending
on the size of the dataset, three different approaches are proposed. The first
two approaches solve small and medium-large datasets on a single GPU; whereas the
third approach deals with extremely large datasets on a Multiple-CPU environment.
OUT proposals provide very competitive performance, outperforming the stateof-
the-art for many parallel prefix operatiOllS, such as the sean primitive, sorting and solving tridiagonal systems
- …