12 research outputs found
Multicore-optimized wavefront diamond blocking for optimizing stencil updates
The importance of stencil-based algorithms in computational science has
focused attention on optimized parallel implementations for multilevel
cache-based processors. Temporal blocking schemes leverage the large bandwidth
and low latency of caches to accelerate stencil updates and approach
theoretical peak performance. A key ingredient is the reduction of data traffic
across slow data paths, especially the main memory interface. In this work we
combine the ideas of multi-core wavefront temporal blocking and diamond tiling
to arrive at stencil update schemes that show large reductions in memory
pressure compared to existing approaches. The resulting schemes show
performance advantages in bandwidth-starved situations, which are exacerbated
by the high bytes per lattice update case of variable coefficients. Our thread
groups concept provides a controllable trade-off between concurrency and memory
usage, shifting the pressure between the memory interface and the CPU. We
present performance results on a contemporary Intel processor
Stencil codes on a vector length agnostic architecture
Data-level parallelism is frequently ignored or underutilized. Achieved through vector/SIMD capabilities, it can provide substantial performance improvements on top of widely used techniques such as thread-level parallelism. However, manual vectorization is a tedious and costly process that needs to be repeated for each specific instruction set or register size. In addition, automatic compiler vectorization is susceptible to code complexity, and usually limited due to data and control dependencies. To address some these issues, Arm recently released a new vector ISA, the Scalable Vector Extension (SVE), which is Vector-Length Agnostic (VLA). VLA enables the generation of binary files that run regardless of the physical vector register length.
In this paper we leverage the main characteristics of SVE to implement and optimize stencil computations, ubiquitous in scientific
computing. We show that SVE enables easy deployment of textbook optimizations like loop unrolling, loop fusion, load trading or data reuse. Our detailed simulations using vector lengths ranging from 128 to 2,048 bits show that these optimizations can lead to performance improvements over straight-forward vectorized code of up to 56.6% for 2,048 bit vectors. In addition, we show that certain optimizations can hurt performance due to a reduction in arithmetic intensity, and provide insight useful for compiler optimizers.This work has been partially supported by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and
Competitiveness (contract TIN2015-65316-P), and by the Generalitat de Catalunya (contracts 2017-SGR-1328 and 2017-SGR-1414).
The Mont-Blanc project receives funding from the EUs H2020 Framework Programme (H2020/2014-2020) under grant agreements
no. 671697 and no. 779877. M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104. Finally,
A. Armejach has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Juan de la Cierva
postdoctoral fellowship number FJCI-2015-24753.Peer ReviewedPostprint (author's final draft
Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Processor
Several emerging petascale architectures use energy-efficient processors with
vectorized computational units and in-order thread processing. On these
architectures the sustained performance of streaming numerical kernels,
ubiquitous in the solution of partial differential equations, represents a
challenge despite the regularity of memory access. Sophisticated optimization
techniques are required to fully utilize the Central Processing Unit (CPU).
We propose a new method for constructing streaming numerical kernels using a
high-level assembly synthesis and optimization framework. We describe an
implementation of this method in Python targeting the IBM Blue Gene/P
supercomputer's PowerPC 450 core. This paper details the high-level design,
construction, simulation, verification, and analysis of these kernels utilizing
a subset of the CPU's instruction set.
We demonstrate the effectiveness of our approach by implementing several
three-dimensional stencil kernels over a variety of cached memory scenarios and
analyzing the mechanically scheduled variants, including a 27-point stencil
achieving a 1.7x speedup over the best previously published results
Le tuilage mono-paramétrique est une transformation polyédrique
Tiling is a crucial program transformation with many benefits: it improves locality, exposes parallelism, allows for adjusting the ops-to-bytes balance of codes, and can be applied at multiple levels. Allowing tile sizes to be symbolic parameters at compile time has many benefits, including efficient autotuning, and run-time adaptability to system variations. For polyhedral programs, parametric tiling in its full generality is known to be non-linear, breaking the mathematical closure properties of the polyhedral model. Most compilation tools therefore either avoid it by only performing fixed size tiling, or apply it in only the final, code generation step. Both strategies have limitations. We first introduce mono-parametric partitioning, a restricted parametric, tiling-like transformation which can be used to express a tiling. We show that, despite being parametric, it is a polyhedral transformation. We first prove that applying mono-parametric partitioning (i) to a polyhedron yields a union of polyhedra, and (ii) to an affine function produces a piecewise-affine function. We then use these properties to show how to partition an entire polyhedral program, including one with reductions. Next, we generalize this transformation to tiles with arbitrary tile shapes that can tesselate the iteration space (e.g., hexagonal, trapezoidal, etc). We show how mono-parametric tiling can be applied at multiple levels, and enables a wide range of polyhedral analysis and transformations to be applied
Analytical cost metrics: days of future past
2019 Summer.Includes bibliographical references.Future exascale high-performance computing (HPC) systems are expected to be increasingly heterogeneous, consisting of several multi-core CPUs and a large number of accelerators, special-purpose hardware that will increase the computing power of the system in a very energy-efficient way. Specialized, energy-efficient accelerators are also an important component in many diverse systems beyond HPC: gaming machines, general purpose workstations, tablets, phones and other media devices. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. This work builds analytical cost models for cost metrics such as time, energy, memory access, and silicon area. These models are used to predict the performance of applications, for performance tuning, and chip design. The idea is to work with domain specific accelerators where analytical cost models can be accurately used for performance optimization. The performance optimization problems are formulated as mathematical optimization problems. This work explores the analytical cost modeling and mathematical optimization approach in a few ways. For stencil applications and GPU architectures, the analytical cost models are developed for execution time as well as energy. The models are used for performance tuning over existing architectures, and are coupled with silicon area models of GPU architectures to generate highly efficient architecture configurations. For matrix chain products, analytical closed form solutions for off-chip data movement are built and used to minimize the total data movement cost of a minimum op count tree
Optimización y evaluación del benchmark Parsec mediante Intel Array Building Blocks
Este Trabajo Fin de Grado pretende presentar Intel Array Building Blocks como una nueva alternativa a los modelos de programación paralela existentes en arquitecturas de memoria compartida. Para ello, se han evaluado las trece aplicaciones contenidas en
el benchmark PARSEC, analizando cuales de ellas podrÃan ofrecer un buen rendimiento. Con el fin de aplicar esta nueva tecnologÃa se han seleccionados dos de los programas del benchmark, Blackscholes y Fluidanimate. El desarrollo se ha realizado en C++ a partir de los códigos secuenciales de las aplicaciones. Intel Array Building Blocks es una biblioteca basada en el lenguaje C++ que
proporciona paralelismo de datos mediante la combinación de varios núcleos e instrucciones vectoriales en arquitecturas multicore. ArBB está orientado a la
optimización de operaciones matriciales y vectoriales. El benchmark PARSEC ha sido concebido con fines tanto académicos como cientÃficos y ofrece un conjunto de aplicaciones para arquitecturas de memoria compartida. Estas aplicaciones, de ámbito muy diverso, han sido paralelizadas previamente con tecnologÃas conocidas como Pthreads, OpenMP o Intel TBB. Blackscholes se basa en la resolución de una conocida ecuación del ámbito financiero mediante la realización de cálculos intensivos de operaciones matemáticas. La optimización de esta aplicación se ha realizado utilizando Ãntegramente Intel Array Building Blocks.
Fluidanimate se encarga de simular la dinámica del movimiento de un fluido. Esta
aplicación de animación ha sido paralelizada con Intel ArBB junto con Intel TBB debido
a las caracterÃsticas especÃficas del algoritmo y las restricciones impuestas en Array Building Blocks. La evaluación del rendimiento de las dos aplicaciones paralelizadas se ha realizado sobre un computador con 8 hilos de ejecución y memoria uniforme (UMA) y un computador con 48 hilos de ejecución y memoria no uniforme (NUMA). Una vez paralelizada la aplicación, los resultados obtenidos de Blackscholes son mejores a los del resto de modelos de programación paralela (Pthreads, OpenMP e Intel TBB),logrando una aceleración máxima de 13,40 para la arquitectura con 8 hilos de ejecución y de 22,11 para la arquitectura con 48 hilos de ejecución. Por su parte, los resultados obtenidos en la versión implementada con ArBB y TBB de Fluidanimate superan, en la mayorÃa de pruebas realizadas, a las versiones optimizadas con TBB y Pthreads incluidas en el benchmark PARSEC. La aceleración máxima obtenida es de 3,21 para la arquitectura con 8 hilos de ejecución y de 18,45 para la arquitectura con 48 hilos de ejecución.Grado en IngenierÃa Informátic
High-order stencil computations on multicore clusters
Stencil computation (SC) is of critical importance for broad scientific and engineering applications. However, it is a challenge to optimize complex, highorder SC on emerging clusters of multicore processors. We have developed a hierarchical SC parallelization framework that combines: (1) spatial decomposition based on message passing; (2) multithreading using critical section-free, dual representation; and (3) single-instruction multiple-data (SIMD) parallelism based on various code transformations. Our SIMD transformations include translocated statement fusion, vector composition via shuffle, and vectorized data layout reordering (e.g. matrix transpose), which are combined with traditional optimization techniques such as loop unrolling. We have thereby implemented two SCs of different characteristics—diagonally dominant, lattice Boltzmann method (LBM) for fluid flow simulation and highly off-diagonal (6-th order) finitedifference time-domain (FDTD) code for seismic wave propagation—on a Cell Broadband Engine (Cell BE) based system (a cluster of PlayStation3 consoles), a dual Intel quadcore platform, and IBM BlueGene/L and P. We have achieved high inter-node and intra-node (multithreading and SIMD) scalability for the diagonally dominant LBM: Weak-scaling parallel efficiency 0.978 on 131,072 BlueGene/P processors; strong-scaling multithreading efficiency 0.882 on 6 cores of Cell BE; and strong-scaling SIMD efficiency 0.780 using 4-element vector registers of Cell BE. Implementation of the high-order SC, on the contrary, is less efficient due to long-stride memory access and the limited size of the vector register file, which points out the need for further optimizations. 1