Search CORE

5,486 research outputs found

Evaluation of Directive-Based GPU Programming Models on a Block Eigensolver with Consideration of Large Sparse Matrices

Author: A Dziekonski
AV Knyazev
AV Knyazev
C Yang
G Ortega
JW Choi
M Knap
M Shao
P Maris
P Maris
X Yang
Y Wang
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Achieving high performance and performance portability for large-scale scientific applications is a major challenge on heterogeneous computing systems such as many-core CPUs and accelerators like GPUs. In this work, we implement a widely used block eigensolver, Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG), using two popular directive based programming models (OpenMP and OpenACC) for GPU-accelerated systems. Our work differs from existing work in that it adopts a holistic approach that optimizes the full solver performance rather than narrowing the problem into small kernels (e.g., SpMM, SpMV). Our LOPBCG GPU implementation achieves a 2.8

{\times }

–4.3

{\times }

speedup over an optimized CPU implementation when tested with four different input matrices. The evaluated configuration compared one Skylake CPU to one Skylake CPU and one NVIDIA V100 GPU. Our OpenMP and OpenACC LOBPCG GPU implementations gave nearly identical performance. We also consider how to create an efficient LOBPCG solver that can solve problems larger than GPU memory capacity. To this end, we create microbenchmarks representing the two dominant kernels (inner product and SpMM kernel) in LOBPCG and then evaluate performance when using two different programming approaches: tiling the kernels, and using Unified Memory with the original kernels. Our tiled SpMM implementation achieves a 2.9

{\times }

and 48.2

{\times }

speedup over the Unified Memory implementation on supercomputers with PCIe Gen3 and NVLink 2.0 CPU to GPU interconnects, respectively

Crossref

eScholarship - University of California

An efficient MPI/OpenMP parallelization of the Hartree-Fock method for the second generation of Intel Xeon Phi processor

Author: Alexeev Yuri
D'mello Michael
Gordon Mark S.
Keipert Kristopher
Mironov Vladimir
Moskovsky Alexander
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/08/2017
Field of study

Modern OpenMP threading techniques are used to convert the MPI-only Hartree-Fock code in the GAMESS program to a hybrid MPI/OpenMP algorithm. Two separate implementations that differ by the sharing or replication of key data structures among threads are considered, density and Fock matrices. All implementations are benchmarked on a super-computer of 3,000 Intel Xeon Phi processors. With 64 cores per processor, scaling numbers are reported on up to 192,000 cores. The hybrid MPI/OpenMP implementation reduces the memory footprint by approximately 200 times compared to the legacy code. The MPI/OpenMP code was shown to run up to six times faster than the original for a range of molecular system sizes.Comment: SC17 conference paper, 12 pages, 7 figure

arXiv.org e-Print Archive

Crossref

CoreTSAR: Task Scheduling for Accelerator-aware Runtimes

Author: de Supinski Bronis R.
Feng Wu-chun
Rountree Barry
Scogland Thomas R. W.
Publication venue
Publication date: 01/01/2012
Field of study

Heterogeneous supercomputers that incorporate computational accelerators such as GPUs are increasingly popular due to their high peak performance, energy efficiency and comparatively low cost. Unfortunately, the programming models and frameworks designed to extract performance from all computational units still lack the flexibility of their CPU-only counterparts. Accelerated OpenMP improves this situation by supporting natural migration of OpenMP code from CPUs to a GPU. However, these implementations currently lose one of OpenMP’s best features, its flexibility: typical OpenMP applications can run on any number of CPUs. GPU implementations do not transparently employ multiple GPUs on a node or a mix of GPUs and CPUs. To address these shortcomings, we present CoreTSAR, our runtime library for dynamically scheduling tasks across heterogeneous resources, and propose straightforward extensions that incorporate this functionality into Accelerated OpenMP. We show that our approach can provide nearly linear speedup to four GPUs over only using CPUs or one GPU while increasing the overall flexibility of Accelerated OpenMP

Computer Science Technical Reports @Virginia Tech

Parallelization Strategies for Density Matrix Renormalization Group Algorithms on Shared-Memory Systems

Author: E. Jeckelmann
Fehske
G. Hager
G. Wellein
Goedecker
Gutzwiller
H. Fehske
Holstein
Hubbard
Jeckelmann
Kanamori
Nishimoto
Wellein
White
White
Publication venue: 'Elsevier BV'
Publication date: 20/05/2003
Field of study

Shared-memory parallelization (SMP) strategies for density matrix renormalization group (DMRG) algorithms enable the treatment of complex systems in solid state physics. We present two different approaches by which parallelization of the standard DMRG algorithm can be accomplished in an efficient way. The methods are illustrated with DMRG calculations of the two-dimensional Hubbard model and the one-dimensional Holstein-Hubbard model on contemporary SMP architectures. The parallelized code shows good scalability up to at least eight processors and allows us to solve problems which exceed the capability of sequential DMRG calculations.Comment: 18 pages, 9 figure

arXiv.org e-Print Archive

Crossref