1,099 research outputs found

    Automatic Loop Kernel Analysis and Performance Modeling With Kerncraft

    Full text link
    Analytic performance models are essential for understanding the performance characteristics of loop kernels, which consume a major part of CPU cycles in computational science. Starting from a validated performance model one can infer the relevant hardware bottlenecks and promising optimization opportunities. Unfortunately, analytic performance modeling is often tedious even for experienced developers since it requires in-depth knowledge about the hardware and how it interacts with the software. We present the "Kerncraft" tool, which eases the construction of analytic performance models for streaming kernels and stencil loop nests. Starting from the loop source code, the problem size, and a description of the underlying hardware, Kerncraft can ideally predict the single-core performance and scaling behavior of loops on multicore processors using the Roofline or the Execution-Cache-Memory (ECM) model. We describe the operating principles of Kerncraft with its capabilities and limitations, and we show how it may be used to quickly gain insights by accelerated analytic modeling.Comment: 11 pages, 4 figures, 8 listing

    Performance and Optimization Abstractions for Large Scale Heterogeneous Systems in the Cactus/Chemora Framework

    Full text link
    We describe a set of lower-level abstractions to improve performance on modern large scale heterogeneous systems. These provide portable access to system- and hardware-dependent features, automatically apply dynamic optimizations at run time, and target stencil-based codes used in finite differencing, finite volume, or block-structured adaptive mesh refinement codes. These abstractions include a novel data structure to manage refinement information for block-structured adaptive mesh refinement, an iterator mechanism to efficiently traverse multi-dimensional arrays in stencil-based codes, and a portable API and implementation for explicit SIMD vectorization. These abstractions can either be employed manually, or be targeted by automated code generation, or be used via support libraries by compilers during code generation. The implementations described below are available in the Cactus framework, and are used e.g. in the Einstein Toolkit for relativistic astrophysics simulations

    Optimizing the Performance of Streaming Numerical Kernels on the IBM Blue Gene/P PowerPC 450 Processor

    Full text link
    Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution of partial differential equations, represents a challenge despite the regularity of memory access. Sophisticated optimization techniques are required to fully utilize the Central Processing Unit (CPU). We propose a new method for constructing streaming numerical kernels using a high-level assembly synthesis and optimization framework. We describe an implementation of this method in Python targeting the IBM Blue Gene/P supercomputer's PowerPC 450 core. This paper details the high-level design, construction, simulation, verification, and analysis of these kernels utilizing a subset of the CPU's instruction set. We demonstrate the effectiveness of our approach by implementing several three-dimensional stencil kernels over a variety of cached memory scenarios and analyzing the mechanically scheduled variants, including a 27-point stencil achieving a 1.7x speedup over the best previously published results

    Multi-GPU acceleration of large-scale density-based topology optimization

    Get PDF
    This work presents a parallel implementation of density-based topology optimization using distributed GPU computing systems. The use of multiple GPU devices allows us accelerating the computing process and increasing the device memory available for GPU computing. This increment of device memory enables us to address large models that commonly do not fit into one GPU device. The most modern scientific computers incorporate these devices to design energy-efficient, low-cost, and high-computing power systems. However, we should adopt the proper techniques to take advantage of the computational resources of such high-performance many-core computing systems. It is well-known that the bottleneck of density-based topology optimization is the solving of the linear elasticity problem using Finite Element Analysis (FEA) during the topology optimization iterations. We solve the linear system of equations obtained from FEA using a distributed conjugate gradient solver preconditioned by a smooth aggregation-based algebraic multigrid (AMG) using GPU computing with multiple devices. The use of aggregation-based AMG reduces memory requirements and improves the efficiency of the interpolation operation. This fact is rewarding for GPU computing. We evaluate the performance and scalability of the distributed GPU system using structured and unstructured meshes. We also test the performance using different 3D finite elements and relaxing operators. Besides, we evaluate the use of numerical approaches to increase the topology optimization performance. Finally, we present a comparison between the many-core computing instance and one efficient multi-core implementation to highlight the advantages of using GPU computing in large-scale density-based topology optimization problems.This work has been supported by the AEI/FEDER and UE under the contract DPI2016-77538-R, and by the “Fundación Séneca – Agencia de Ciencia y Tecnología de la Región de Murcia” of Spain under the contract 20911/PI/18
    corecore