242 research outputs found
Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes
The ongoing hardware evolution exhibits an escalation in the number, as well
as in the heterogeneity, of computing resources. The pressure to maintain
reasonable levels of performance and portability forces application developers
to leave the traditional programming paradigms and explore alternative
solutions. PaStiX is a parallel sparse direct solver, based on a dynamic
scheduler for modern hierarchical manycore architectures. In this paper, we
study the benefits and limits of replacing the highly specialized internal
scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and
StarPU. The tasks graph of the factorization step is made available to the two
runtimes, providing them the opportunity to process and optimize its traversal
in order to maximize the algorithm efficiency for the targeted hardware
platform. A comparative study of the performance of the PaStiX solver on top of
its native internal scheduler, PaRSEC, and StarPU frameworks, on different
execution environments, is performed. The analysis highlights that these
generic task-based runtimes achieve comparable results to the
application-optimized embedded scheduler on homogeneous platforms. Furthermore,
they are able to significantly speed up the solver on heterogeneous
environments by taking advantage of the accelerators while hiding the
complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014
Linear solvers for power grid optimization problems: a review of GPU-accelerated linear solvers
The linear equations that arise in interior methods for constrained
optimization are sparse symmetric indefinite and become extremely
ill-conditioned as the interior method converges. These linear systems present
a challenge for existing solver frameworks based on sparse LU or LDL^T
decompositions. We benchmark five well known direct linear solver packages
using matrices extracted from power grid optimization problems. The achieved
solution accuracy varies greatly among the packages. None of the tested
packages delivers significant GPU acceleration for our test cases
GPU-resident sparse direct linear solvers for alternating current optimal power flow analysis
Integrating renewable resources within the transmission grid at a wide scale poses significant challenges for economic dispatch as it requires analysis with more optimization parameters, constraints, and sources of uncertainty. This motivates the investigation of more efficient computational methods, especially those for solving the underlying linear systems, which typically take more than half of the overall computation time. In this paper, we present our work on sparse linear solvers that take advantage of hardware accelerators, such as graphical processing units (GPUs), and improve the overall performance when used within economic dispatch computations. We treat the problems as sparse, which allows for faster execution but also makes the implementation of numerical methods more challenging. We present the first GPU-native sparse direct solver that can execute on both AMD and NVIDIA GPUs. We demonstrate significant performance improvements when using high-performance linear solvers within alternating current optimal power flow (ACOPF) analysis. Furthermore, we demonstrate the feasibility of getting significant performance improvements by executing the entire computation on GPU-based hardware. Finally, we identify outstanding research issues and opportunities for even better utilization of heterogeneous systems, including those equipped with GPUs
An efficient GPU version of the preconditioned GMRES method
[EN] In a large number of scientific applications, the solution of sparse linear systems is the stage that concentrates most of the computational effort. This situation has motivated the study and development of several iterative solvers, among which preconditioned Krylov subspace methods occupy a place of privilege. In a previous effort, we developed a GPU-aware version of the GMRES method included in ILUPACK, a package of solvers distinguished by its inverse-based multilevel ILU preconditioner. In this work, we study the performance of our previous proposal and integrate several enhancements in order to mitigate its principal bottlenecks. The numerical evaluation shows that our novel proposal can reach important run-time reductions.Aliaga, JI.; Dufrechou, E.; Ezzatti, P.; Quintana-Orti, ES. (2019). An efficient GPU version of the preconditioned GMRES method. The Journal of Supercomputing. 75(3):1455-1469. https://doi.org/10.1007/s11227-018-2658-1S14551469753Aliaga JI, Badia RM, Barreda M, Bollhöfer M, Dufrechou E, Ezzatti P, Quintana-OrtĂ ES (2016) Exploiting task and data parallelism in ILUPACK’s preconditioned CG solver on NUMA architectures and many-core accelerators. Parallel Comput 54:97–107Aliaga JI, Bollhöfer M, Dufrechou E, Ezzatti P, Quintana-OrtĂ ES (2016) A data-parallel ILUPACK for sparse general and symmetric indefinite linear systems. In: Lecture Notes in Computer Science, 14th Int. Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms—HeteroPar’16. SpringerAliaga JI, Bollhöfer M, MartĂn AF, Quintana-OrtĂ ES (2011) Exploiting thread-level parallelism in the iterative solution of sparse linear systems. Parallel Comput 37(3):183–202Aliaga JI, Bollhöfer M, MartĂn AF, Quintana-OrtĂ ES (2012) Parallelization of multilevel ILU preconditioners on distributed-memory multiprocessors. Appl Parallel Sci Comput LNCS 7133:162–172Aliaga JI, Dufrechou E, Ezzatti P, Quintana-OrtĂ ES (2018) Accelerating a preconditioned GMRES method in massively parallel processors. In: CMMSE 2018: Proceedings of the 18th International Conference on Mathematical Methods in Science and Engineering (2018)Bollhöfer M, Grote MJ, Schenk O (2009) Algebraic multilevel preconditioner for the Helmholtz equation in heterogeneous media. SIAM J Sci Comput 31(5):3781–3805Bollhöfer M, Saad Y (2006) Multilevel preconditioners constructed from inverse-based ILUs. SIAM J Sci Comput 27(5):1627–1650Dufrechou E, Ezzatti P (2018) A new GPU algorithm to compute a level set-based analysis for the parallel solution of sparse triangular systems. In: 2018 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018, Canada, 2018. IEEE Computer SocietyDufrechou E, Ezzatti P (2018) Solving sparse triangular linear systems in modern GPUs: a synchronization-free algorithm. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp 196–203. https://doi.org/10.1109/PDP2018.2018.00034Eijkhout V (1992) LAPACK working note 50: distributed sparse data structures for linear algebra operations. Tech. rep., Knoxville, TN, USAGolub GH, Van Loan CF (2013) Matrix computationsHe K, Tan SXD, Zhao H, Liu XX, Wang H, Shi G (2016) Parallel GMRES solver for fast analysis of large linear dynamic systems on GPU platforms. Integration 52:10–22 http://www.sciencedirect.com/science/article/pii/S016792601500084XLiu W, Li A, Hogg JD, Duff IS, Vinter B (2017) Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurr Comput 29(21)Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, PhiladelphiaSchenk O, Wächter A, Weiser M (2008) Inertia revealing preconditioning for large-scale nonconvex constrained optimization. SIAM J Sci Comput 31(2):939–96
Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs
Many problems in geophysical and atmospheric modelling require the fast
solution of elliptic partial differential equations (PDEs) in "flat" three
dimensional geometries. In particular, an anisotropic elliptic PDE for the
pressure correction has to be solved at every time step in the dynamical core
of many numerical weather prediction models, and equations of a very similar
structure arise in global ocean models, subsurface flow simulations and gas and
oil reservoir modelling. The elliptic solve is often the bottleneck of the
forecast, and an algorithmically optimal method has to be used and implemented
efficiently. Graphics Processing Units have been shown to be highly efficient
for a wide range of applications in scientific computing, and recently
iterative solvers have been parallelised on these architectures. We describe
the GPU implementation and optimisation of a Preconditioned Conjugate Gradient
(PCG) algorithm for the solution of a three dimensional anisotropic elliptic
PDE for the pressure correction in NWP. Our implementation exploits the strong
vertical anisotropy of the elliptic operator in the construction of a suitable
preconditioner. As the algorithm is memory bound, performance can be improved
significantly by reducing the amount of global memory access. We achieve this
by using a matrix-free implementation which does not require explicit storage
of the matrix and instead recalculates the local stencil. Global memory access
can also be reduced by rewriting the algorithm using loop fusion and we show
that this further reduces the runtime on the GPU. We demonstrate the
performance of our matrix-free GPU code by comparing it to a sequential CPU
implementation and to a matrix-explicit GPU code which uses existing libraries.
The absolute performance of the algorithm for different problem sizes is
quantified in terms of floating point throughput and global memory bandwidth.Comment: 18 pages, 7 figure
An Experimental Study of Two-Level Schwarz Domain Decomposition Preconditioners on GPUs
The generalized Dryja--Smith--Widlund (GDSW) preconditioner is a two-level
overlapping Schwarz domain decomposition (DD) preconditioner that couples a
classical one-level overlapping Schwarz preconditioner with an
energy-minimizing coarse space. When used to accelerate the convergence rate of
Krylov subspace iterative methods, the GDSW preconditioner provides robustness
and scalability for the solution of sparse linear systems arising from the
discretization of a wide range of partial different equations. In this paper,
we present FROSch (Fast and Robust Schwarz), a domain decomposition solver
package which implements GDSW-type preconditioners for both CPU and GPU
clusters. To improve the solver performance on GPUs, we use a novel
decomposition to run multiple MPI processes on each GPU, reducing both solver's
computational and storage costs and potentially improving the convergence rate.
This allowed us to obtain competitive or faster performance using GPUs compared
to using CPUs alone. We demonstrate the performance of FROSch on the Summit
supercomputer with NVIDIA V100 GPUs, where we used NVIDIA Multi-Process Service
(MPS) to implement our decomposition strategy.
The solver has a wide variety of algorithmic and implementation choices,
which poses both opportunities and challenges for its GPU implementation. We
conduct a thorough experimental study with different solver options including
the exact or inexact solution of the local overlapping subdomain problems on a
GPU. We also discuss the effect of using the iterative variant of the
incomplete LU factorization and sparse-triangular solve as the approximate
local solver, and using lower precision for computing the whole FROSch
preconditioner. Overall, the solve time was reduced by factors of about
using GPUs, while the GPU acceleration of the numerical setup time
depend on the solver options and the local matrix sizes.Comment: Accepted for publication in IPDPS'2
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
Accelerating the task/data-parallel version of ILUPACKÂżs BiCG in multi-CPU/GPU configurations
[EN] ILUPACK is a valuable tool for the solution of sparse linear systems via iterative Krylov subspace-based methods. Its relevance for the solution of real problems has motivated several efforts to enhance its performance on parallel machines. In this work we focus on exploiting the task-level parallelism derived from the structure of the BiCG method, in addition to the data-level parallelism of the internal matrix computations, with the goal of boosting the performance of a GPU (graphics processing unit) implementation of this solver. First, we revisit the use of dual-GPU systems to execute independent stages of the BiCG concurrently on both accelerators, while leveraging the extra memory space to improve the data access patterns. In addition, we extend our ideas to compute the BiCG method efficiently in multicore platforms with a single GPU. In this line, we study the possibilities offered by hybrid CPU-GPU computations, as well as a novel synchronization-free sparse triangular linear solver. The experimental results with the new solvers show important acceleration factors with respect to the previous data-parallel CPU and GPU versions. (C) 2019 Elsevier B.V. All rights reserved.J. I. Aliaga and E. S. Quintana-Orti were supported by project TIN2017-82972-R of the MINECO and FEDER. E. Dufrechou and P. Ezzatti were supported by Programa de Desarrollo de las Ciencias Basicas (PEDECIBA), Uruguay.Aliaga, JI.; Dufrechou, E.; Ezzatti, P.; Quintana-OrtĂ, ES. (2019). Accelerating the task/data-parallel version of ILUPACKÂżs BiCG in multi-CPU/GPU configurations. Parallel Computing. 85:79-87. https://doi.org/10.1016/j.parco.2019.02.005S79878
PARALLEL ALGORITHMS FOR NONLINEAR PROGRAMMING AND APPLICATIONS IN PHARMACEUTICAL MANUFACTURING
Effective manufacturing of pharmaceuticals presents a number of challenging optimization problems due to complex distributed, time-independent models and the need to handle uncertainty. These challenges are multiplied when real-time solutions are required. The demand for fast solution of nonlinear optimization problems, coupled with the emergence of new concurrent computing architectures, drives the need for parallel algorithms to solve challenging NLP problems. The goal of this work is the development of parallel algorithms for nonlinear programming problems on different computing architectures, and the application of large-scale nonlinear programming on challenging problems in pharmaceutical manufacturing
- …