Search CORE

288 research outputs found

Efficient multicore-aware parallelization strategies for iterative stencil computations

Author: Bergen
Christen
Datta
Datta
Frigo
Hager
Kowarschik
Treibig
Wellein
Wittmann
Zeiser
Publication venue: 'Elsevier BV'
Publication date: 10/04/2010
Field of study

Stencil computations consume a major part of runtime in many scientific simulation codes. As prototypes for this class of algorithms we consider the iterative Jacobi and Gauss-Seidel smoothers and aim at highly efficient parallel implementations for cache-based multicore architectures. Temporal cache blocking is a known advanced optimization technique, which can reduce the pressure on the memory bus significantly. We apply and refine this optimization for a recently presented temporal blocking strategy designed to explicitly utilize multicore characteristics. Especially for the case of Gauss-Seidel smoothers we show that simultaneous multi-threading (SMT) can yield substantial performance improvements for our optimized algorithm.Comment: 15 pages, 10 figure

arXiv.org e-Print Archive

Crossref

Multicore-optimized wavefront diamond blocking for optimizing stencil updates

Author: Hager Georg
Keyes David
Ltaief Hatem
Malas Tareq
Stengel Holger
Wellein Gerhard
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 12/10/2014
Field of study

The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multi-core wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor

arXiv.org e-Print Archive

CiteSeerX

A Multilevel Parallelization Framework for High-Order Stencil Computations

Author: A. Stathopoulos
A. Taflove
F. Desprez
G. Rivera
J. Dongarra
K. Datta
K.J. Barker
M. Bromley
M. Frigo
M. Snir
R. Bleck
R. Harlick
S. Kamil
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Crossref

A Survey on Hardware-aware and Heterogeneous Computing on Multicore Processors and Accelerators

Author: Buchty Rainer
Heuveline Vincent
Karl Wolfgang
Weiß Jan-Philipp
Publication venue: Karlsruher Institut für Technologie
Publication date: 01/01/2009
Field of study

KITopen

Seismic Wave Propagation Simulations on Low-power and Performance-centric Manycores

Author: Abdelkhalek
Aochi
Aubry
Bianco
Castro
Christen
Datta
de Dinechin
Dumbser
Dupros
Dupros
Dursun
Emilio Francesquini
Fabrice Dupros
Francesquini
Francesquini
Göddeke
Hideo Aochi
Horowitz
Hähnel
Jean-François Méhaut
Komatitsch
Krueger
Lawson
Lysmer
Martin
Mercier
Michéa
Micikevicius
Morari
Márcio Castro
Pereira
Philippe O.A. Navaux
Pilla
Rajovic
Rashti
Reinders
Rivera
Saenger
Tang
Totoni
Varghese
Virieux
Zhang
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

International audienceThe large processing requirements of seismic wave propagation simulations make High Performance Computing (HPC) architectures a natural choice for their execution. However, to keep both the current pace of performance improvements and the power consumption under a strict power budget, HPC systems must be more energy e than ever. As a response to this need, energy-e and low-power processors began to make their way into the market. In this paper we employ a novel low-power processor, the MPPA-256 manycore, to perform seismic wave propagation simulations. It has 256 cores connected by a NoC, no cache-coherence and only a limited amount of on-chip memory. We describe how its particular architectural characteristics influenced our solution for an energy-e implementation. As a counterpoint to the low-power MPPA-256 architecture, we employ Xeon Phi, a performance-centric manycore. Although both processors share some architectural similarities, the challenges to implement an e seismic wave propagation kernel on these platforms are very di↵erent. In this work we compare the performance and energy e of our implementations for these processors to proven and optimized solutions for other hardware platforms such as general-purpose processors and a GPU. Our experimental results show that MPPA-256 has the best energy e consuming at least 77 % less energy than the other evaluated platforms, whereas the performance of our solution for the Xeon Phi is on par with a state-of-the-art solution for GPUs

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server