2,325 research outputs found
Algorithmic patterns for -matrices on many-core processors
In this work, we consider the reformulation of hierarchical ()
matrix algorithms for many-core processors with a model implementation on
graphics processing units (GPUs). matrices approximate specific
dense matrices, e.g., from discretized integral equations or kernel ridge
regression, leading to log-linear time complexity in dense matrix-vector
products. The parallelization of matrix operations on many-core
processors is difficult due to the complex nature of the underlying algorithms.
While previous algorithmic advances for many-core hardware focused on
accelerating existing matrix CPU implementations by many-core
processors, we here aim at totally relying on that processor type. As main
contribution, we introduce the necessary parallel algorithmic patterns allowing
to map the full matrix construction and the fast matrix-vector
product to many-core hardware. Here, crucial ingredients are space filling
curves, parallel tree traversal and batching of linear algebra operations. The
resulting model GPU implementation hmglib is the, to the best of the authors
knowledge, first entirely GPU-based Open Source matrix library of
this kind. We conclude this work by an in-depth performance analysis and a
comparative performance study against a standard matrix library,
highlighting profound speedups of our many-core parallel approach
An adaptive hierarchical domain decomposition method for parallel contact dynamics simulations of granular materials
A fully parallel version of the contact dynamics (CD) method is presented in
this paper. For large enough systems, 100% efficiency has been demonstrated for
up to 256 processors using a hierarchical domain decomposition with dynamic
load balancing. The iterative scheme to calculate the contact forces is left
domain-wise sequential, with data exchange after each iteration step, which
ensures its stability. The number of additional iterations required for
convergence by the partially parallel updates at the domain boundaries becomes
negligible with increasing number of particles, which allows for an effective
parallelization. Compared to the sequential implementation, we found no
influence of the parallelization on simulation results.Comment: 19 pages, 15 figures, published in Journal of Computational Physics
(2011
A pattern language for parallelizing irregular algorithms
Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para obtenção do grau de Mestre em Engenharia InformáticaIn irregular algorithms, data set’s dependences and distributions cannot be statically predicted.
This class of algorithms tends to organize computations in terms of data locality instead of parallelizing control in multiple threads. Thus, opportunities for exploiting parallelism vary dynamically, according to how the algorithm changes data dependences. As such, effective parallelization of such algorithms requires new approaches that account for that dynamic nature.
This dissertation addresses the problem of building efficient parallel implementations of irregular algorithms by proposing to extract, analyze and document patterns of concurrency and parallelism present in the Galois parallelization framework for irregular algorithms.
Patterns capture formal representations of a tangible solution to a problem that arises in a well defined context within a specific domain.
We document the said patterns in a pattern language, i.e., a set of inter-dependent patterns that compose well-documented template solutions that can be reused whenever a certain problem arises in a well-known context
A Sparse SCF algorithm and its parallel implementation: Application to DFTB
We present an algorithm and its parallel implementation for solving a self
consistent problem as encountered in Hartree Fock or Density Functional Theory.
The algorithm takes advantage of the sparsity of matrices through the use of
local molecular orbitals. The implementation allows to exploit efficiently
modern symmetric multiprocessing (SMP) computer architectures. As a first
application, the algorithm is used within the density functional based tight
binding method, for which most of the computational time is spent in the linear
algebra routines (diagonalization of the Fock/Kohn-Sham matrix). We show that
with this algorithm (i) single point calculations on very large systems
(millions of atoms) can be performed on large SMP machines (ii) calculations
involving intermediate size systems (1~000--100~000 atoms) are also strongly
accelerated and can run efficiently on standard servers (iii) the error on the
total energy due to the use of a cut-off in the molecular orbital coefficients
can be controlled such that it remains smaller than the SCF convergence
criterion.Comment: 13 pages, 11 figure
A Three-Level Parallelisation Scheme and Application to the Nelder-Mead Algorithm
We consider a three-level parallelisation scheme. The second and third levels
define a classical two-level parallelisation scheme and some load balancing
algorithm is used to distribute tasks among processes. It is well-known that
for many applications the efficiency of parallel algorithms of the second and
third level starts to drop down after some critical parallelisation degree is
reached. This weakness of the two-level template is addressed by introduction
of one additional parallelisation level. As an alternative to the basic solver
some new or modified algorithms are considered on this level. The idea of the
proposed methodology is to increase the parallelisation degree by using less
efficient algorithms in comparison with the basic solver. As an example we
investigate two modified Nelder-Mead methods. For the selected application, a
few partial differential equations are solved numerically on the second level,
and on the third level the parallel Wang's algorithm is used to solve systems
of linear equations with tridiagonal matrices. A greedy workload balancing
heuristic is proposed, which is oriented to the case of a large number of
available processors. The complexity estimates of the computational tasks are
model-based, i.e. they use empirical computational data
- …