468 research outputs found
High-Performance Solvers for Dense Hermitian Eigenproblems
We introduce a new collection of solvers - subsequently called EleMRRR - for
large-scale dense Hermitian eigenproblems. EleMRRR solves various types of
problems: generalized, standard, and tridiagonal eigenproblems. Among these,
the last is of particular importance as it is a solver on its own right, as
well as the computational kernel for the first two; we present a fast and
scalable tridiagonal solver based on the Algorithm of Multiple Relatively
Robust Representations - referred to as PMRRR. Like the other EleMRRR solvers,
PMRRR is part of the freely available Elemental library, and is designed to
fully support both message-passing (MPI) and multithreading parallelism (SMP).
As a result, the solvers can equally be used in pure MPI or in hybrid MPI-SMP
fashion. We conducted a thorough performance study of EleMRRR and ScaLAPACK's
solvers on two supercomputers. Such a study, performed with up to 8,192 cores,
provides precise guidelines to assemble the fastest solver within the ScaLAPACK
framework; it also indicates that EleMRRR outperforms even the fastest solvers
built from ScaLAPACK's components
Parallelization of the ADI method exploring vector computing in GPUs
Dissertação de mestrado integrado em Engenharia InformáticaThe 2D convection-diffusion is a well-known problem in scientific simulation that often uses
a direct method to solve a system of N linear equations, which requires N3 operations.
This problem can be solved using a more efficient computational method, known as the
alternating direction implicit (ADI). It solves a system of N linear equations in 2N times with
N operations each, implemented in two steps, one to solve row by row, the other column by
column. Each N operation is fully independent in each step, which opens an opportunity to
an embarrassingly parallel solution. This method also explores the way matrices are stored in
computer memory, either in row-major or column-major, by splitting each iteration in two.
The major bottleneck of this method is solving the system of linear equations. These
systems of linear equations can be described as tridiagonal matrices since the elements are
always stored on the three main diagonals of the matrices. Algorithms tailored for tridiagonal
matrices, can significantly improve the performance. These can be sequential (i.e. the Thomas
algorithm) or parallel (i.e. the cyclic reduction CR, and the parallel cyclic reduction PCR).
Current vector extensions in conventional scalar processing units, such as x86-64 and
ARM devices, require the vector elements to be in contiguous memory locations to avoid
performance penalties. To overcome these limitations in dot products several approaches
are proposed and evaluated in this work, both in general-purpose processing units and in
specific accelerators, namely NVidia GPUs.
Profiling the code execution on a server based on x86-64 devices showed that the ADI
method needs a combination of CPU computation power and memory transfer speed. This
is best showed on a server based on the Intel manycore device, KNL, where the algorithm
scales until the memory bandwidth is no longer enough to feed all 64 computing cores. A
dual-socket server based on 16-core Xeon Skylakes, with AVX-512 vector support, proved to
be a better choice: the algorithm executes in less time and scales better.
The introduction of GPU computing to further improve the execution performance (and
also using other optimisation techniques, namely a different thread scheme and shared
memory to speed up the process) showed better results for larger grid sizes (above 32Ki x
32Ki). The CUDA development environment also showed a better performance than using
OpenCL, in most cases. The largest difference was using a hybrid CR-PCR, where the OpenCL
code displayed a major performance improvement when compared to CUDA. But even with
this speedup, the better average time for the ADI method on all tested configurations on a
NVidia GPU was using CUDA on an available updated GPU (with a Pascal architecture) and
the CR as the auxiliary method.O problema da convecção-difusão é utilizado em simulaçãos cientificas que regularmente utilizam
métodos diretos para solucionar um sistema de N equações lineares e necessitam de N3 operações.
O problema pode ser resolvido utilizando um método computacionalmente mais eficiente para
resolver um sistema de N equações lineares com N operações cada, implementado em dois passos,
um solucionando linha a linha e outro solucionando coluna a coluna. Cada par de N operações
são independentes em cada passo, havendo assim uma oportunidade de utilizar uma solução em baraçosamente paralela. Este método também explora o modo de guardar as matrizes na memória do
computados, sendo esta por linhas ou em colunas, dividindo cada iteração em duas, este método é
conhecido como o método de direção alternada.
O maior bottleneck deste problema é a resolução dos sistemas de equações lineares criados pelo
ADI. Estes sistemas podem ser descritos como matrizes tridiagonais, visto todos os seus elementos se
encontrarem nas 3 diagonais interiores e a utilização de métodos estudados para este caso é necessário
para conseguir atingir a melhor performance possível. Esses métodos podem ser sequenciais (como o
algoritmo de Thomas) ou paralelos (como o CR e o PCR)
As extensões vectoriais utilizadas nas atuais unidades de processamento, como dispositivos x86-64
e ARM, necessitam que os elementos do vetor estejam em blocos de memória contíguos para não
sofrer penalizações. Algumas abordagens foram estudadas neste trabalho para as ultrapassar, tanto
em processadores convencionais como em aceleradores de computação. Os registos do tempo em
servidores baseado em dispositivos x86-64 mostram que o ADI necessitam de uma combinação de
poder de processamento assim como velocidade de transferência de dados. Isto é demonstrado
especialmente no servidor baseado no dispositivo KNL da Intel, no qual o algoritmo escala até que
a largura de banda deixe de ser suficiente para o problema. Um servidor com dois sockets em que
cada é composto por um dispositivo com 16 cores baseado na arquitetura Xeon Skylake, com acesso
ao AVX-512, mostrou ser a melhor escolha: o algoritmo faz as mesmas operações em menos tempo e
escala melhor.
Com a introdução de computação com GPUs para melhorar a performance do programa mostrou
melhores resultados para problemas de maiores dimensões (tamanho acima de 32Ki x 32Ki celulas).
O desenvolvimento em CUDA também mostrou melhores resultados que em OpenCL na maioria
dos casos. A maior divergência foi observada ao utilizar o método CR-PCR, onde o OpenCL mostrou
melhor performance que em CUDA. Mas mesmo com este método sendo mais eficaz que o mesmo em
CUDA, o melhor performance com o método ADI foi observado utilizando CUDA no GPU mais recente
estudado com o método CR
Improved Accuracy and Parallelism for MRRR-based Eigensolvers -- A Mixed Precision Approach
The real symmetric tridiagonal eigenproblem is of outstanding importance in
numerical computations; it arises frequently as part of eigensolvers for
standard and generalized dense Hermitian eigenproblems that are based on a
reduction to tridiagonal form. For its solution, the algorithm of Multiple
Relatively Robust Representations (MRRR) is among the fastest methods. Although
fast, the solvers based on MRRR do not deliver the same accuracy as competing
methods like Divide & Conquer or the QR algorithm. In this paper, we
demonstrate that the use of mixed precisions leads to improved accuracy of
MRRR-based eigensolvers with limited or no performance penalty. As a result, we
obtain eigensolvers that are not only equally or more accurate than the best
available methods, but also -in most circumstances- faster and more scalable
than the competition
A Fast Solver for Large Tridiagonal Systems on Multi-Core Processors (Lass Library)
[Abstract]: Many problems of industrial and scientific interest require the solving of tridiagonal linear systems. This paper presents several implementations for the parallel solving of large tridiagonal systems on multi-core architectures, using the OmpSs programming model. The strategy used for the parallelization is based on the combination of two different existing algorithms, PCR and Thomas. The Thomas algorithm, which cannot be parallelized, requires the fewest number of floating point operations. The PCR algorithm is the most popular parallel method, but it is more computationally expensive than Thomas. The method proposed in this paper starts applying the PCR algorithm to break down one large tridiagonal system into a set of smaller and independent ones. In a second step, these independent systems are concurrently solved using Thomas. The paper also contains an analytical study of which is the best point to switch from PCR to Thomas. Also, the paper addresses the main performance issues of combining PCR and Thomas proposing a set of alternative implementations, some of them even imply algorithmic changes. The performance evaluation shows that the best implementation achieves a peak speedup of 4 with respect to the Intel MKL counterpart routine and 2.5 with respect to a single-threaded Thomas.This work was supported in part by the European Union’s Horizon 2020 Framework Programme for Research and Innovation under the
Specific Grant Agreements Human Brain Project SGA1 and Human Brain Project SGA2 under Grant 720270 and Grant 785907, in part by
the Spanish Ministry of Economy and Competitiveness under the Project Computación de Altas Prestaciones VII under Grant
TIN2015-65316-P, in part by the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project
MPEXPAR: Models de Programació i Entorns d’Execució Paralůlels under Grant 2014-SGR-1051, in part by the Juan de la Cierva under
Grant IJCI-2017-33511, in part by the Fujitsu under the Barcelona Supercomputing Center-Fujitsu Joint Project: Math Libraries Migration
and Optimization, in part by the Ministerio de Economía, Industria y Competitividad of Spain, in part by the Fondo Europeo de Desarrollo
Regional Funds of the European Union under Grant TIN2016-75845-P, and in part by the Xunta de Galicia co-founded by the European
Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups under Grant ED431C
2017/04, and in part by the Centro Singular de Investigación de Galicia accreditatión 2016-2019 under Grant ED431G/01.Xunta de Galicia; ED431C 2017/04Xunta de Galicia; ED431G/01Generalitat de Catalunya; 2014-SGR-105
Solving Dense Generalized Eigenproblems on Multi-threaded Architectures
We compare two approaches to compute a fraction of the spectrum of dense symmetric definite generalized eigenproblems: one is based on the reduction to tridiagonal form, and the other on the Krylov-subspace iteration. Two large-scale applications, arising in molecular dynamics and material science, are employed to investigate the contributions of the application, architecture, and parallelism of the method to the performance of the solvers. The experimental results on a state-of-the-art 8-core platform, equipped with a graphics processing unit (GPU), reveal that in realistic applications, iterative Krylov-subspace methods can be a competitive approach also for the solution of dense problems
MRRR-based Eigensolvers for Multi-core Processors and Supercomputers
The real symmetric tridiagonal eigenproblem is of outstanding importance in
numerical computations; it arises frequently as part of eigensolvers for
standard and generalized dense Hermitian eigenproblems that are based on a
reduction to tridiagonal form. For its solution, the algorithm of Multiple
Relatively Robust Representations (MRRR or MR3 in short) - introduced in the
late 1990s - is among the fastest methods. To compute k eigenpairs of a real
n-by-n tridiagonal T, MRRR only requires O(kn) arithmetic operations; in
contrast, all the other practical methods require O(k^2 n) or O(n^3) operations
in the worst case. This thesis centers around the performance and accuracy of
MRRR.Comment: PhD thesi
A Three-Level Parallelisation Scheme and Application to the Nelder-Mead Algorithm
We consider a three-level parallelisation scheme. The second and third levels
define a classical two-level parallelisation scheme and some load balancing
algorithm is used to distribute tasks among processes. It is well-known that
for many applications the efficiency of parallel algorithms of the second and
third level starts to drop down after some critical parallelisation degree is
reached. This weakness of the two-level template is addressed by introduction
of one additional parallelisation level. As an alternative to the basic solver
some new or modified algorithms are considered on this level. The idea of the
proposed methodology is to increase the parallelisation degree by using less
efficient algorithms in comparison with the basic solver. As an example we
investigate two modified Nelder-Mead methods. For the selected application, a
few partial differential equations are solved numerically on the second level,
and on the third level the parallel Wang's algorithm is used to solve systems
of linear equations with tridiagonal matrices. A greedy workload balancing
heuristic is proposed, which is oriented to the case of a large number of
available processors. The complexity estimates of the computational tasks are
model-based, i.e. they use empirical computational data
Scalable many-core algorithms for tridiagonal solvers
We present a novel distributed memory Tridiagonal solver library, targeting large-scale systems based on modern multi-core and many-core processor architectures. The library uses methods based on both approximate and exact algorithms. Performance comparisons with the state-of-the-art, using both a large Cray EX system and a GPU cluster show the algorithmic trade-offs required at increasing machine scale to achieve good performance, particularly considering the advent of exascale systems
Efficient Multigrid Preconditioners for Atmospheric Flow Simulations at High Aspect Ratio
Many problems in fluid modelling require the efficient solution of highly
anisotropic elliptic partial differential equations (PDEs) in "flat" domains.
For example, in numerical weather- and climate-prediction an elliptic PDE for
the pressure correction has to be solved at every time step in a thin spherical
shell representing the global atmosphere. This elliptic solve can be one of the
computationally most demanding components in semi-implicit semi-Lagrangian time
stepping methods which are very popular as they allow for larger model time
steps and better overall performance. With increasing model resolution,
algorithmically efficient and scalable algorithms are essential to run the code
under tight operational time constraints. We discuss the theory and practical
application of bespoke geometric multigrid preconditioners for equations of
this type. The algorithms deal with the strong anisotropy in the vertical
direction by using the tensor-product approach originally analysed by B\"{o}rm
and Hiptmair [Numer. Algorithms, 26/3 (2001), pp. 219-234]. We extend the
analysis to three dimensions under slightly weakened assumptions, and
numerically demonstrate its efficiency for the solution of the elliptic PDE for
the global pressure correction in atmospheric forecast models. For this we
compare the performance of different multigrid preconditioners on a
tensor-product grid with a semi-structured and quasi-uniform horizontal mesh
and a one dimensional vertical grid. The code is implemented in the Distributed
and Unified Numerics Environment (DUNE), which provides an easy-to-use and
scalable environment for algorithms operating on tensor-product grids. Parallel
scalability of our solvers on up to 20,480 cores is demonstrated on the HECToR
supercomputer.Comment: 22 pages, 6 Figures, 2 Table
- …