Search CORE

1,950 research outputs found

Efficient multicore-aware parallelization strategies for iterative stencil computations

Author: Bergen
Christen
Datta
Datta
Frigo
Hager
Kowarschik
Treibig
Wellein
Wittmann
Zeiser
Publication venue: 'Elsevier BV'
Publication date: 10/04/2010
Field of study

Stencil computations consume a major part of runtime in many scientific simulation codes. As prototypes for this class of algorithms we consider the iterative Jacobi and Gauss-Seidel smoothers and aim at highly efficient parallel implementations for cache-based multicore architectures. Temporal cache blocking is a known advanced optimization technique, which can reduce the pressure on the memory bus significantly. We apply and refine this optimization for a recently presented temporal blocking strategy designed to explicitly utilize multicore characteristics. Especially for the case of Gauss-Seidel smoothers we show that simultaneous multi-threading (SMT) can yield substantial performance improvements for our optimized algorithm.Comment: 15 pages, 10 figure

arXiv.org e-Print Archive

Crossref

Multicore-optimized wavefront diamond blocking for optimizing stencil updates

Author: Hager Georg
Keyes David
Ltaief Hatem
Malas Tareq
Stengel Holger
Wellein Gerhard
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 12/10/2014
Field of study

The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multi-core wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor

arXiv.org e-Print Archive

CiteSeerX

Accelerated Modeling of Near and Far-Field Diffraction for Coronagraphic Optical Systems

Author: Abdellah
Akeret
Cooke
Cooley
Douglas
Douglas
Fangohr
Frigo
Greenfield
Greenfield
Hirst
Jones
Kluyver
Lawrence
Lumbres
Macintosh
Marois
Mendillo
Morgan
Noecker
Pavlyk
Perrin
Shimobaba
Soummer
Steinbach
Stone
Yamamoto
Publication venue: 'SPIE-Intl Soc Optical Eng'
Publication date: 17/06/2018
Field of study

Accurately predicting the performance of coronagraphs and tolerancing optical surfaces for high-contrast imaging requires a detailed accounting of diffraction effects. Unlike simple Fraunhofer diffraction modeling, near and far-field diffraction effects, such as the Talbot effect, are captured by plane-to-plane propagation using Fresnel and angular spectrum propagation. This approach requires a sequence of computationally intensive Fourier transforms and quadratic phase functions, which limit the design and aberration sensitivity parameter space which can be explored at high-fidelity in the course of coronagraph design. This study presents the results of optimizing the multi-surface propagation module of the open source Physical Optics Propagation in PYthon (POPPY) package. This optimization was performed by implementing and benchmarking Fourier transforms and array operations on graphics processing units, as well as optimizing multithreaded numerical calculations using the NumExpr python library where appropriate, to speed the end-to-end simulation of observatory and coronagraph optical systems. Using realistic systems, this study demonstrates a greater than five-fold decrease in wall-clock runtime over POPPY's previous implementation and describes opportunities for further improvements in diffraction modeling performance.Comment: Presented at SPIE ASTI 2018, Austin Texas. 11 pages, 6 figure

arXiv.org e-Print Archive

Crossref

Data-parallel intra decoding for block-based image and video coding on massively parallel architectures

Author: De Cock Jan
Hollemeersch Charles
Lambert Peter
Pieters Bart
Van de Walle Rik
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

Ghent University Academic Bibliography

Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines

Author: Cooper Lee
Kong Jun
Kurc Tahsin
Pan Tony
Saltz Joel
Teodoro George
Publication venue
Publication date: 14/09/2012
Field of study

In this paper, we address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization overheads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing operations: morphological reconstruction and euclidean distance transform. Our results show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50x and 85x with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively.Comment: 37 pages, 16 figure

arXiv.org e-Print Archive

CiteSeerX