Search CORE

18 research outputs found

Devito: Towards a generic Finite Difference DSL using Symbolic Python

Author: Gorman Gerard
Kazakas Paulius
Kukreja Navjot
Lange Michael
Louboutin Mathias
Luporini Fabio
Pandolfo Vincenzo
Velesko Paulius
Vieira Felippe
Publication venue
Publication date: 01/01/2016
Field of study

Domain specific languages (DSL) have been used in a variety of fields to express complex scientific problems in a concise manner and provide automated performance optimization for a range of computational architectures. As such DSLs provide a powerful mechanism to speed up scientific Python computation that goes beyond traditional vectorization and pre-compilation approaches, while allowing domain scientists to build applications within the comforts of the Python software ecosystem. In this paper we present Devito, a new finite difference DSL that provides optimized stencil computation from high-level problem specifications based on symbolic Python expressions. We demonstrate Devito's symbolic API and performance advantages over traditional Python acceleration methods before highlighting its use in the scientific context of seismic inversion problems.Comment: pyHPC 2016 conference submissio

arXiv.org e-Print Archive

Crossref

Spiral - Imperial College Digital Repository

Performance Improvement in Kernels by Guiding Compiler Auto-Vectorization Heuristics

Author: Renatio Miceli
Publication venue
Publication date
Field of study

Vectorization support in hardware continues to expand and grow as well we still continue on superscalar architectures. Unfortunately, compilers are not always able to generate optimal code for the hardware;detecting and generating vectorized code is extremely complex. Programmers can use a number of tools to aid in development and tuning, but most of these tools require expert or domain-specific knowledge to use. In this work we aim to provide techniques for determining the best way to optimize certain codes, with an end goal of guiding the compiler into generating optimized code without requiring expert knowledge from the developer. Initally, we study how to combine vectorization reports with iterative comilation and code generation and summarize our insights and patterns on how the compiler vectorizes code. Our utilities for iterative compiliation and code generation can be further used by non-experts in the generation and analysis of programs. Finally, we leverage the obtained knowledge to design a Support Vector Machine classifier to predict the speedup of a program given a sequence of optimization underprediction, with 82% of these accurate within 15 % both ways

ZENODO

Multicore-optimized wavefront diamond blocking for optimizing stencil updates

Author: Hager Georg
Keyes David
Ltaief Hatem
Malas Tareq
Stengel Holger
Wellein Gerhard
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 12/10/2014
Field of study

The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multi-core wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor

arXiv.org e-Print Archive

CiteSeerX

Stencil Computation with Vector Outer Product

Author: Ma Penghao
Wang Long
Wang Zhe
Yan Baicheng
Yuan Liang
Zhang Yunquan
Zhao Wenxuan
Publication venue
Publication date: 24/10/2023
Field of study

Matrix computation units have been equipped in current architectures to accelerate AI and high performance computing applications. The matrix multiplication and vector outer product are two basic instruction types. The latter one is lighter since the inputs are vectors. Thus it provides more opportunities to develop flexible algorithms for problems other than dense linear algebra computing and more possibilities to optimize the implementation. Stencil computations represent a common class of nested loops in scientific and engineering applications. This paper proposes a novel stencil algorithm using vector outer products. Unlike previous work, the new algorithm arises from the stencil definition in the scatter mode and is initially expressed with formulas of vector outer products. The implementation incorporates a set of optimizations to improve the memory reference pattern, execution pipeline and data reuse by considering various algorithmic options and the data sharing between input vectors. Evaluation on a simulator shows that our design achieves a substantial speedup compared with vectorized stencil algorithm

arXiv.org e-Print Archive

Algebraic Tiling

Author: Clauss Philippe
Rossetti Clément
Publication venue: HAL CCSD
Publication date: 16/01/2023
Field of study

International audienceIn this paper, we present an ongoing work whose aim is to propose a new loop tiling technique where tiles are characterized by their volumes-the number of embedded iterations-instead of their sizes-the lengths of their edges. Tiles of quasi-equal volumes are dynamically generated while the tiled loops are running, whatever are the original loop bounds, which may be constant or depending linearly of surrounding loop iterators. The adopted strategy is to successively and hierarchically slice the iteration domain in parts of quasi-equal volumes, from the outermost to the innermost loop dimensions. Since the number of such slices can be exactly chosen, quasi-perfect load balancing is reached by choosing, for each parallel loop, the number of slices as being equal to the number of parallel threads, or to a multiple of this number. Moreover, the approach avoids partial tiles by construction, thus yielding a perfect covering of the iteration domain minimizing the loop control cost. Finally, algebraic tiling makes dynamic scheduling of the parallel threads fairly purposeless for the handled parallel tiled loops

INRIA a CCSD electronic archive server