18 research outputs found
Devito: Towards a generic Finite Difference DSL using Symbolic Python
Domain specific languages (DSL) have been used in a variety of fields to
express complex scientific problems in a concise manner and provide automated
performance optimization for a range of computational architectures. As such
DSLs provide a powerful mechanism to speed up scientific Python computation
that goes beyond traditional vectorization and pre-compilation approaches,
while allowing domain scientists to build applications within the comforts of
the Python software ecosystem. In this paper we present Devito, a new finite
difference DSL that provides optimized stencil computation from high-level
problem specifications based on symbolic Python expressions. We demonstrate
Devito's symbolic API and performance advantages over traditional Python
acceleration methods before highlighting its use in the scientific context of
seismic inversion problems.Comment: pyHPC 2016 conference submissio
Performance Improvement in Kernels by Guiding Compiler Auto-Vectorization Heuristics
Vectorization support in hardware continues to expand and grow as well we still continue on superscalar architectures. Unfortunately, compilers are not always able to generate optimal code for the hardware;detecting and generating vectorized code is extremely complex. Programmers can use a number of tools to aid in development and tuning, but most of these tools require expert or domain-specific knowledge to use. In this work we aim to provide techniques for determining the best way to optimize certain codes, with an end goal of guiding the compiler into generating optimized code without requiring expert knowledge from the developer. Initally, we study how to combine vectorization reports with iterative comilation and code generation and summarize our insights and patterns on how the compiler vectorizes code. Our utilities for iterative compiliation and code generation can be further used by non-experts in the generation and analysis of programs. Finally, we leverage the obtained knowledge to design a Support Vector Machine classifier to predict the speedup of a program given a sequence of optimization underprediction, with 82% of these accurate within 15 % both ways
Multicore-optimized wavefront diamond blocking for optimizing stencil updates
The importance of stencil-based algorithms in computational science has
focused attention on optimized parallel implementations for multilevel
cache-based processors. Temporal blocking schemes leverage the large bandwidth
and low latency of caches to accelerate stencil updates and approach
theoretical peak performance. A key ingredient is the reduction of data traffic
across slow data paths, especially the main memory interface. In this work we
combine the ideas of multi-core wavefront temporal blocking and diamond tiling
to arrive at stencil update schemes that show large reductions in memory
pressure compared to existing approaches. The resulting schemes show
performance advantages in bandwidth-starved situations, which are exacerbated
by the high bytes per lattice update case of variable coefficients. Our thread
groups concept provides a controllable trade-off between concurrency and memory
usage, shifting the pressure between the memory interface and the CPU. We
present performance results on a contemporary Intel processor
Stencil Computation with Vector Outer Product
Matrix computation units have been equipped in current architectures to
accelerate AI and high performance computing applications. The matrix
multiplication and vector outer product are two basic instruction types. The
latter one is lighter since the inputs are vectors. Thus it provides more
opportunities to develop flexible algorithms for problems other than dense
linear algebra computing and more possibilities to optimize the implementation.
Stencil computations represent a common class of nested loops in scientific and
engineering applications. This paper proposes a novel stencil algorithm using
vector outer products. Unlike previous work, the new algorithm arises from the
stencil definition in the scatter mode and is initially expressed with formulas
of vector outer products. The implementation incorporates a set of
optimizations to improve the memory reference pattern, execution pipeline and
data reuse by considering various algorithmic options and the data sharing
between input vectors. Evaluation on a simulator shows that our design achieves
a substantial speedup compared with vectorized stencil algorithm
Algebraic Tiling
International audienceIn this paper, we present an ongoing work whose aim is to propose a new loop tiling technique where tiles are characterized by their volumes-the number of embedded iterations-instead of their sizes-the lengths of their edges. Tiles of quasi-equal volumes are dynamically generated while the tiled loops are running, whatever are the original loop bounds, which may be constant or depending linearly of surrounding loop iterators. The adopted strategy is to successively and hierarchically slice the iteration domain in parts of quasi-equal volumes, from the outermost to the innermost loop dimensions. Since the number of such slices can be exactly chosen, quasi-perfect load balancing is reached by choosing, for each parallel loop, the number of slices as being equal to the number of parallel threads, or to a multiple of this number. Moreover, the approach avoids partial tiles by construction, thus yielding a perfect covering of the iteration domain minimizing the loop control cost. Finally, algebraic tiling makes dynamic scheduling of the parallel threads fairly purposeless for the handled parallel tiled loops