Search CORE

785 research outputs found

Multicore-optimized wavefront diamond blocking for optimizing stencil updates

Author: Hager Georg
Keyes David
Ltaief Hatem
Malas Tareq
Stengel Holger
Wellein Gerhard
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 12/10/2014
Field of study

The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multi-core wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor

arXiv.org e-Print Archive

CiteSeerX

Pervasive Parallel And Distributed Computing In A Liberal Arts College Curriculum

Author: Danner Andrew
Newhall Tia
Webb Kevin
Publication venue: 'Transformative Works and Cultures'
Publication date: 01/07/2017
Field of study

We present a model for incorporating parallel and distributed computing (PDC) throughout an undergraduate CS curriculum. Our curriculum is designed to introduce students early to parallel and distributed computing topics and to expose students to these topics repeatedly in the context of a wide variety of CS courses. The key to our approach is the development of a required intermediate-level course that serves as a introduction to computer systems and parallel computing. It serves as a requirement for every CS major and minor and is a prerequisite to upper-level courses that expand on parallel and distributed computing topics in different contexts. With the addition of this new course, we are able to easily make room in upper-level courses to add and expand parallel and distributed computing topics. The goal of our curricular design is to ensure that every graduating CS major has exposure to parallel and distributed computing, with both a breadth and depth of coverage. Our curriculum is particularly designed for the constraints of a small liberal arts college, however, much of its ideas and its design are applicable to any undergraduate CS curriculum

Works

CoreTSAR: Task Scheduling for Accelerator-aware Runtimes

Author: de Supinski Bronis R.
Feng Wu-chun
Rountree Barry
Scogland Thomas R. W.
Publication venue
Publication date: 01/01/2012
Field of study

Heterogeneous supercomputers that incorporate computational accelerators such as GPUs are increasingly popular due to their high peak performance, energy efficiency and comparatively low cost. Unfortunately, the programming models and frameworks designed to extract performance from all computational units still lack the flexibility of their CPU-only counterparts. Accelerated OpenMP improves this situation by supporting natural migration of OpenMP code from CPUs to a GPU. However, these implementations currently lose one of OpenMP’s best features, its flexibility: typical OpenMP applications can run on any number of CPUs. GPU implementations do not transparently employ multiple GPUs on a node or a mix of GPUs and CPUs. To address these shortcomings, we present CoreTSAR, our runtime library for dynamically scheduling tasks across heterogeneous resources, and propose straightforward extensions that incorporate this functionality into Accelerated OpenMP. We show that our approach can provide nearly linear speedup to four GPUs over only using CPUs or one GPU while increasing the overall flexibility of Accelerated OpenMP

Computer Science Technical Reports @Virginia Tech

Improved parallelization techniques for the density matrix renormalization group

Author: Alvarez
Chan
Chapman
D.J. García
Davidson
Foulkes
Gustafson
Hager
Hallberg
Hubbard
Julián Rincón
K. Hallberg
Kanamori
Korch
Kurashige
Lanczos
Murray
Noack
Noack
Schollwöck
Snir
White
White
Wilson
Yamada
Publication venue: 'Elsevier BV'
Publication date: 17/04/2010
Field of study

A distributed-memory parallelization strategy for the density matrix renormalization group is proposed for cases where correlation functions are required. This new strategy has substantial improvements with respect to previous works. A scalability analysis shows an overall serial fraction of 9.4% and an efficiency of around 60% considering up to eight nodes. Sources of possible parallel slowdown are pointed out and solutions to circumvent these issues are brought forward in order to achieve a better performance.Comment: 8 pages, 4 figures; version published in Computer Physics Communication

arXiv.org e-Print Archive

Crossref

Response-time analysis of DAG tasks supporting heterogeneous computing

Author: Saifullah A.
Serrano M. A.
Tegra IDIA
Vargas R.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 24/06/2018
Field of study

Hardware platforms are evolving towards parallel and heterogeneous architectures to overcome the increasing necessity of more performance in the real-time domain. Parallel programming models are fundamental to exploit the performance capabilities of these architectures. This paper proposes a novel response time analysis (RTA) for verifying the schedulability of DAG tasks supporting heterogeneous computing. It analyzes the impact of executing part of the DAG in the accelerator device. As a result, the response time upper bound of the system is more precise than the one provided by currently existing RTA targeting homogeneous architectures.This work is supported by the Spanish Ministry of Science and Innovation under contract TIN2015-65316-PPeer ReviewedPostprint (published version

Crossref

UPCommons. Portal del coneixement obert de la UPC

A parallel algorithm for Hamiltonian matrix construction in electron-molecule collision calculations: MPI-SCATCI

Author: Al-Refaie Ahmed F.
Tennyson Jonathan
Publication venue: 'Elsevier BV'
Publication date: 09/09/2017
Field of study

Construction and diagonalization of the Hamiltonian matrix is the rate-limiting step in most low-energy electron -- molecule collision calculations. Tennyson (J Phys B, 29 (1996) 1817) implemented a novel algorithm for Hamiltonian construction which took advantage of the structure of the wavefunction in such calculations. This algorithm is re-engineered to make use of modern computer architectures and the use of appropriate diagonalizers is considered. Test calculations demonstrate that significant speed-ups can be gained using multiple CPUs. This opens the way to calculations which consider higher collision energies, larger molecules and / or more target states. The methodology, which is implemented as part of the UK molecular R-matrix codes (UKRMol and UKRMol+) can also be used for studies of bound molecular Rydberg states, photoionisation and positron-molecule collisions.Comment: Write up of a computer program MPI-SCATCI Computer Physics Communications, in pres

arXiv.org e-Print Archive

UCL Discovery