47 research outputs found
Out-of-core macromolecular simulations on multithreaded architectures
We address the solution of large-scale eigenvalue problems that appear in the motion simulation of complex macromolecules on multithreaded platforms, consisting of multicore processors and possibly a graphics processor (GPU). In particular, we compare specialized implementations of several high- performance eigensolvers that, by relying on disk storage and out-of-core (OOC) techniques, can in principle tackle the large memory requirements of these biological problems, which in general do not fit into the main memory of current desktop machines. All these OOC eigensolvers, except for one, are composed of compute-bound (i.e., arithmetically-intensive) operations, which we accelerate by exploiting the performance of current multicore processors and, in some cases, by additionally off-loading certain parts of the computation to a GPU accelerator. One of the eigensolvers is a memory-bound algorithm, which strongly constrains its performance when the data is on disk. However, this method exhibits a much lower arithmetic cost compared with its compute- bound alternatives for this particular application. Experimental results on a desktop platform, representative of current server technology, illustrate the potential of these methods to address the simulation of biological activity
Out-of-core solution of eigenproblems for macromolecular simulations
We consider the solution of large-scale eigenvalue problems that appear in the motion simulation of complex macromolecules on desktop platforms. To tackle the dimension of the matrices that are involved in these problems, we formulate out-of-core (OOC) variants of the two selected eigensolvers, that basically decouple the performance of the solver from the storage capacity. Furthermore, we contend with the high
computational complexity of the solvers by off-loading the arithmetically-intensive parts of the algorithms to a hardware graphics accelerator
GPU implementation of Krylov solvers for block-tridiagonal eigenvalue problems
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-32149-3_18In an eigenvalue problem defined by one or two matrices with block-tridiagonal structure, if only a few eigenpairs are required it is interesting to consider iterative methods based on Krylov subspaces, even if matrix blocks are dense. In this context, using the GPU for the associated dense linear algebra may provide high performance. We analyze this in an implementation done in the context of SLEPc, the Scalable Library for Eigenvalue Problem Computations. In the case of a generalized eigenproblem or when interior eigenvalues are computed with shift-and-invert, the main computational kernel is the solution of linear systems with a block-tridiagonal matrix. We explore possible implementations of this operation on the GPU, including a block cyclic reduction algorithm.This work was partially supported by the Spanish Ministry of Economy and Competitiveness under grant TIN2013-41049-P. Alejandro Lamas was supported by the Spanish Ministry of Education, Culture and Sport through grant FPU13-06655.Lamas Daviña, A.; Román Moltó, JE. (2016). GPU implementation of Krylov solvers for block-tridiagonal eigenvalue problems. En Parallel Processing and Applied Mathematics. Springer. 182-191. https://doi.org/10.1007%2F978-3-319-32149-3_18S182191Baghapour, B., Esfahanian, V., Torabzadeh, M., Darian, H.M.: A discontinuous Galerkin method with block cyclic reduction solver for simulating compressible flows on GPUs. Int. J. Comput. Math. 92(1), 110–131 (2014)Bientinesi, P., Igual, F.D., Kressner, D., Petschow, M., Quintana-OrtÃ, E.S.: Condensed forms for the symmetric eigenvalue problem on multi-threaded architectures. Concur. Comput. Pract. Exp. 23, 694–707 (2011)Haidar, A., Ltaief, H., Dongarra, J.: Toward a high performance tile divide and conquer algorithm for the dense symmetric eigenvalue problem. SIAM J. Sci. Comput. 34(6), C249–C274 (2012)Heller, D.: Some aspects of the cyclic reduction algorithm for block tridiagonal linear systems. SIAM J. Numer. Anal. 13(4), 484–496 (1976)Hernandez, V., Roman, J.E., Vidal, V.: SLEPc: a scalable and flexible toolkit for the solution of eigenvalue problems. ACM Trans. Math. Softw. 31(3), 351–362 (2005)Hirshman, S.P., Perumalla, K.S., Lynch, V.E., Sanchez, R.: BCYCLIC: a parallel block tridiagonal matrix cyclic solver. J. Comput. Phys. 229(18), 6392–6404 (2010)Minden, V., Smith, B., Knepley, M.G.: Preliminary implementation of PETSc using GPUs. In: Yuen, D.A., Wang, L., Chi, X., Johnsson, L., Ge, W., Shi, Y. (eds.) GPU Solutions to Multi-scale Problems in Science and Engineering. Lecture Notes in Earth System Sciences, pp. 131–140. Springer, Heidelberg (2013)NVIDIA: CUBLAS Library V7.0. Technical report, DU-06702-001 v7.0, NVIDIA Corporation (2015)Park, A.J., Perumalla, K.S.: Efficient heterogeneous execution on large multicore and accelerator platforms: case study using a block tridiagonal solver. J. Parallel and Distrib. Comput. 73(12), 1578–1591 (2013)Reguly, I., Giles, M.: Efficient sparse matrix-vector multiplication on cache-based GPUs. In: Innovative Parallel Computing (InPar), pp. 1–12 (2012)Roman, J.E., Vasconcelos, P.B.: Harnessing GPU power from high-level libraries: eigenvalues of integral operators with SLEPc. In: International Conference on Computational Science. Procedia Computer Science, vol. 18, pp. 2591–2594. Elsevier (2013)Seal, S.K., Perumalla, K.S., Hirshman, S.P.: Revisiting parallel cyclic reduction and parallel prefix-based algorithms for block tridiagonal systems of equations. J. Parallel Distrib. Comput. 73(2), 273–280 (2013)Stewart, G.W.: A Krylov-Schur algorithm for large eigenproblems. SIAM J. Matrix Anal. Appl. 23(3), 601–614 (2001)Tomov, S., Nath, R., Dongarra, J.: Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing. Parallel Comput. 36(12), 645–654 (2010)Vomel, C., Tomov, S., Dongarra, J.: Divide and conquer on hybrid GPU-accelerated multicore systems. SIAM J. Sci. Comput. 34(2), C70–C82 (2012)Zhang, Y., Cohen, J., Owens, J.D.: Fast tridiagonal solvers on the GPU. In: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPopp 2010, pp. 127–136 (2010
MPI-CUDA parallel linear solvers for block-tridiagonal matrices in the context of SLEPc's eigensolvers
[EN] We consider the computation of a few eigenpairs of a generalized eigenvalue problem Ax = lambda Bx with block-tridiagonal matrices, not necessarily symmetric, in the context of Krylov methods. In this kind of computation, it is often necessary to solve a linear system of equations in each iteration of the eigensolver, for instance when B is not the identity matrix or when computing interior eigenvalues with the shift-and-invert spectral transformation. In this work, we aim to compare different direct linear solvers that can exploit the block-tridiagonal structure. Block cyclic reduction and the Spike algorithm are considered. A parallel implementation based on MPI is developed in the context of the SLEPc library. The use of GPU devices to accelerate local computations shows to be competitive for large block sizes.This work was supported by Agencia Estatal de Investigacion (AEI) under grant TIN2016-75985-P, which includes European Commission ERDF funds. Alejandro Lamas Davina was supported by the Spanish Ministry of Education, Culture and Sport through a grant with reference FPU13-06655.Lamas Daviña, A.; Roman, JE. (2018). MPI-CUDA parallel linear solvers for block-tridiagonal matrices in the context of SLEPc's eigensolvers. Parallel Computing. 74:118-135. https://doi.org/10.1016/j.parco.2017.11.006S1181357
Dense and sparse parallel linear algebra algorithms on graphics processing units
Una lÃnea de desarrollo seguida en el campo de la supercomputación es el uso de procesadores de propósito especÃfico para acelerar determinados tipos de cálculo. En esta tesis estudiamos el uso de tarjetas gráficas como aceleradores de la computación y lo aplicamos al ámbito del álgebra lineal. En particular trabajamos con la biblioteca SLEPc para resolver problemas de cálculo de autovalores en matrices de gran dimensión, y para aplicar funciones de matrices en los cálculos de aplicaciones cientÃficas. SLEPc es una biblioteca paralela que se basa en el estándar MPI y está desarrollada con la premisa de ser escalable, esto es, de permitir resolver problemas más grandes al aumentar las unidades de procesado.
El problema lineal de autovalores, Ax = lambda x en su forma estándar, lo abordamos con el uso de técnicas iterativas, en concreto con métodos de Krylov, con los que calculamos una pequeña porción del espectro de autovalores. Este tipo de algoritmos se basa en generar un subespacio de tamaño reducido (m) en el que proyectar el problema de gran dimensión (n), siendo m << n. Una vez se ha proyectado el problema, se resuelve este mediante métodos directos, que nos proporcionan aproximaciones a los autovalores del problema inicial que querÃamos resolver. Las operaciones que se utilizan en la expansión del subespacio varÃan en función de si los autovalores deseados están en el exterior o en el interior del espectro. En caso de buscar autovalores en el exterior del espectro, la expansión se hace mediante multiplicaciones matriz-vector. Esta operación la realizamos en la GPU, bien mediante el uso de bibliotecas o mediante la creación de funciones que aprovechan la estructura de la matriz. En caso de autovalores en el interior del espectro, la expansión requiere resolver sistemas de ecuaciones lineales. En esta tesis implementamos varios algoritmos para la resolución de sistemas de ecuaciones lineales para el caso especÃfico de matrices con estructura tridiagonal a bloques, que se ejecutan en GPU.
En el cálculo de las funciones de matrices hemos de diferenciar entre la aplicación directa de una función sobre una matriz, f(A), y la aplicación de la acción de una función de matriz sobre un vector, f(A)b. El primer caso implica un cálculo denso que limita el tamaño del problema. El segundo permite trabajar con matrices dispersas grandes, y para resolverlo también hacemos uso de métodos de Krylov. La expansión del subespacio se hace mediante multiplicaciones matriz-vector, y hacemos uso de GPUs de la misma forma que al resolver autovalores. En este caso el problema proyectado comienza siendo de tamaño m, pero se incrementa en m en cada reinicio del método. La resolución del problema proyectado se hace aplicando una función de matriz de forma directa. Nosotros hemos implementado varios algoritmos para calcular las funciones de matrices raÃz cuadrada y exponencial, en las que el uso de GPUs permite acelerar el cálculo.One line of development followed in the field of supercomputing is the use of specific purpose processors to speed up certain types of computations. In this thesis we study the use of graphics processing units as computer accelerators and apply it to the field of linear algebra. In particular, we work with the SLEPc library to solve large scale eigenvalue problems, and to apply matrix functions in scientific applications. SLEPc is a parallel library based on the MPI standard and is developed with the premise of being scalable, i.e. to allow solving larger problems by increasing the processing units.
We address the linear eigenvalue problem, Ax = lambda x in its standard form, using iterative techniques, in particular with Krylov's methods, with which we calculate a small portion of the eigenvalue spectrum. This type of algorithms is based on generating a subspace of reduced size (m) in which to project the large dimension problem (n), being m << n. Once the problem has been projected, it is solved by direct methods, which provide us with approximations of the eigenvalues of the initial problem we wanted to solve. The operations used in the expansion of the subspace vary depending on whether the desired eigenvalues are from the exterior or from the interior of the spectrum. In the case of searching for exterior eigenvalues, the expansion is done by matrix-vector multiplications. We do this on the GPU, either by using libraries or by creating functions that take advantage of the structure of the matrix. In the case of eigenvalues from the interior of the spectrum, the expansion requires solving linear systems of equations. In this thesis we implemented several algorithms to solve linear systems of equations for the specific case of matrices with a block-tridiagonal structure, that are run on GPU.
In the computation of matrix functions we have to distinguish between the direct application of a matrix function, f(A), and the action of a matrix function on a vector, f(A)b. The first case involves a dense computation that limits the size of the problem. The second allows us to work with large sparse matrices, and to solve it we also make use of Krylov's methods. The expansion of subspace is done by matrix-vector multiplication, and we use GPUs in the same way as when solving eigenvalues. In this case the projected problem starts being of size m, but it is increased by m on each restart of the method. The solution of the projected problem is done by directly applying a matrix function. We have implemented several algorithms to compute the square root and the exponential matrix functions, in which the use of GPUs allows us to speed up the computation.Una lÃnia de desenvolupament seguida en el camp de la supercomputació és l'ús de processadors de propòsit especÃfic per a accelerar determinats tipus de cà lcul. En aquesta tesi estudiem l'ús de targetes grà fiques com a acceleradors de la computació i ho apliquem a l'à mbit de l'à lgebra lineal. En particular treballem amb la biblioteca SLEPc per a resoldre problemes de cà lcul d'autovalors en matrius de gran dimensió, i per a aplicar funcions de matrius en els cà lculs d'aplicacions cientÃfiques. SLEPc és una biblioteca paral·lela que es basa en l'està ndard MPI i està desenvolupada amb la premissa de ser escalable, açò és, de permetre resoldre problemes més grans en augmentar les unitats de processament.
El problema lineal d'autovalors, Ax = lambda x en la seua forma està ndard, ho abordem amb l'ús de tècniques iteratives, en concret amb mètodes de Krylov, amb els quals calculem una xicoteta porció de l'espectre d'autovalors. Aquest tipus d'algorismes es basa a generar un subespai de grandà ria reduïda (m) en el qual projectar el problema de gran dimensió (n), sent m << n. Una vegada s'ha projectat el problema, es resol aquest mitjançant mètodes directes, que ens proporcionen aproximacions als autovalors del problema inicial que volÃem resoldre. Les operacions que s'utilitzen en l'expansió del subespai varien en funció de si els autovalors desitjats estan en l'exterior o a l'interior de l'espectre. En cas de cercar autovalors en l'exterior de l'espectre, l'expansió es fa mitjançant multiplicacions matriu-vector. Aquesta operació la realitzem en la GPU, bé mitjançant l'ús de biblioteques o mitjançant la creació de funcions que aprofiten l'estructura de la matriu. En cas d'autovalors a l'interior de l'espectre, l'expansió requereix resoldre sistemes d'equacions lineals. En aquesta tesi implementem diversos algorismes per a la resolució de sistemes d'equacions lineals per al cas especÃfic de matrius amb estructura tridiagonal a blocs, que s'executen en GPU.
En el cà lcul de les funcions de matrius hem de diferenciar entre l'aplicació directa d'una funció sobre una matriu, f(A), i l'aplicació de l'acció d'una funció de matriu sobre un vector, f(A)b. El primer cas implica un cà lcul dens que limita la grandà ria del problema. El segon permet treballar amb matrius disperses grans, i per a resoldre-ho també fem ús de mètodes de Krylov. L'expansió del subespai es fa mitjançant multiplicacions matriu-vector, i fem ús de GPUs de la mateixa forma que en resoldre autovalors. En aquest cas el problema projectat comença sent de grandà ria m, però s'incrementa en m en cada reinici del mètode. La resolució del problema projectat es fa aplicant una funció de matriu de forma directa. Nosaltres hem implementat diversos algorismes per a calcular les funcions de matrius arrel quadrada i exponencial, en les quals l'ús de GPUs permet accelerar el cà lcul.Lamas Daviña, A. (2018). Dense and sparse parallel linear algebra algorithms on graphics processing units [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/112425TESI
ChASE: Chebyshev Accelerated Subspace iteration Eigensolver for sequences of Hermitian eigenvalue problems
Solving dense Hermitian eigenproblems arranged in a sequence with direct
solvers fails to take advantage of those spectral properties which are
pertinent to the entire sequence, and not just to the single problem. When such
features take the form of correlations between the eigenvectors of consecutive
problems, as is the case in many real-world applications, the potential benefit
of exploiting them can be substantial. We present ChASE, a modern algorithm and
library based on subspace iteration with polynomial acceleration. Novel to
ChASE is the computation of the spectral estimates that enter in the filter and
an optimization of the polynomial degree which further reduces the necessary
FLOPs. ChASE is written in C++ using the modern software engineering concepts
which favor a simple integration in application codes and a straightforward
portability over heterogeneous platforms. When solving sequences of Hermitian
eigenproblems for a portion of their extremal spectrum, ChASE greatly benefits
from the sequence's spectral properties and outperforms direct solvers in many
scenarios. The library ships with two distinct parallelization schemes,
supports execution over distributed GPUs, and it is easily extensible to other
parallel computing architectures.Comment: 33 pages. Submitted to ACM TOM
Solving Large Dense Symmetric Eigenproblem on Hybrid Architectures
Dense symmetric eigenproblem is one of the most significant problems in the numerical linear algebra that arises in numerous research fields such as bioinformatics, computational chemistry, and meteorology. In the past years, the problems arising in these fields become bigger than ever resulting in growing demands in both computational power as well as the storage capacities. In such problems, the eigenproblem becomes the main computational bottleneck for which solution is required an extremely high computational power. Modern computing architectures that can meet these growing demands are those that combine the power of the traditional multi-core processors and the general-purpose GPUs and are called hybrid systems. These systems exhibit very high performance when the data fits into the GPU memory ; however, if the volume of the data exceeds the total GPU memory, i.e. the data is out-of-core from the GPU perspective, the performance rapidly decreases. This dissertation is focused on the development of the algorithms that solve dense symmetric eigenproblems on the hybrid GPU-based architectures. In particular, it aims at developing the eigensolvers that exhibit very high performance even if a problem is out- of-core for the GPU. The developed out-of-core eigensolvers are evaluated and compared on real problems that arise in the simulation of molecular motions. In such problems the data, usually too large to fit into the GPU memory, are stored in the main memory and copied to the GPU memory in pieces. That approach results in the performance drop due to a slow interconnection and a high memory latency. To overcome this problem an approach that applies blocking strategy and re- designs the existing eigensolvers, in order to decrease the volume of data transferred and the number of memory transfers, is presented. This approach designs and implements a set of the block- oriented, communication-avoiding BLAS routines that overlap the data transfers with the number of computations performed. Next, these routines are applied to speed-up the following eigensolvers: the solver based on the multi-stage reduction to a tridiagonal form, the Krylov subspace-based method, and the spectral divide-and-conquer method. Although the out-of-core BLAS routines significantly improve the performance of these three eigensolvers, a careful re-design is required in order to tackle the solution of the large eigenproblems on the hybrid CPU-GPU systems. In the out-of-core multi-stage reduction approach, the factor that mostly influences the performance is the band size of the obtained band matrix. On the other hand, the Krylov subspace- based method, although it is based on the memory- bound BLAS-2 operations, is the fastest method if only a small subset of the eigenpairs is required. Finally, the spectral divide-and- conquer algorithm, which exhibits significantly higher arithmetic cost than the other two eigensolvers, achieves extremely high performance since it can be performed completely in terms of the compute-bound BLAS-3 operations. Furthermore, its high arithmetic cost is further reduced by exploiting the special structure of a matrix. Finally, the results presented in the dissertation show that the three out-of-core eigen- solvers, for a set of the specific macromolecular problems, significantly overcome the multi-core variants and attain high flops rate even if data do not fit into the GPU memory. This proves that it is possible to solve large eigenproblems on modest computing systems equipped with a single GPU
Computational Physics on Graphics Processing Units
The use of graphics processing units for scientific computations is an
emerging strategy that can significantly speed up various different algorithms.
In this review, we discuss advances made in the field of computational physics,
focusing on classical molecular dynamics, and on quantum simulations for
electronic structure calculations using the density functional theory, wave
function techniques, and quantum field theory.Comment: Proceedings of the 11th International Conference, PARA 2012,
Helsinki, Finland, June 10-13, 201
A hierarchically blocked Jacobi SVD algorithm for single and multiple graphics processing units
We present a hierarchically blocked one-sided Jacobi algorithm for the
singular value decomposition (SVD), targeting both single and multiple graphics
processing units (GPUs). The blocking structure reflects the levels of GPU's
memory hierarchy. The algorithm may outperform MAGMA's dgesvd, while retaining
high relative accuracy. To this end, we developed a family of parallel pivot
strategies on GPU's shared address space, but applicable also to inter-GPU
communication. Unlike common hybrid approaches, our algorithm in a single GPU
setting needs a CPU for the controlling purposes only, while utilizing GPU's
resources to the fullest extent permitted by the hardware. When required by the
problem size, the algorithm, in principle, scales to an arbitrary number of GPU
nodes. The scalability is demonstrated by more than twofold speedup for
sufficiently large matrices on a Tesla S2050 system with four GPUs vs. a single
Fermi card.Comment: Accepted for publication in SIAM Journal on Scientific Computin
Roadmap on Electronic Structure Codes in the Exascale Era
Electronic structure calculations have been instrumental in providing many
important insights into a range of physical and chemical properties of various
molecular and solid-state systems. Their importance to various fields,
including materials science, chemical sciences, computational chemistry and
device physics, is underscored by the large fraction of available public
supercomputing resources devoted to these calculations. As we enter the
exascale era, exciting new opportunities to increase simulation numbers, sizes,
and accuracies present themselves. In order to realize these promises, the
community of electronic structure software developers will however first have
to tackle a number of challenges pertaining to the efficient use of new
architectures that will rely heavily on massive parallelism and hardware
accelerators. This roadmap provides a broad overview of the state-of-the-art in
electronic structure calculations and of the various new directions being
pursued by the community. It covers 14 electronic structure codes, presenting
their current status, their development priorities over the next five years,
and their plans towards tackling the challenges and leveraging the
opportunities presented by the advent of exascale computing.Comment: Submitted as a roadmap article to Modelling and Simulation in
Materials Science and Engineering; Address any correspondence to Vikram
Gavini ([email protected]) and Danny Perez ([email protected]