242 research outputs found

    Modeling power consumption of 3D MPDATA and the CG method on ARM and Intel multicore architectures

    Get PDF
    We propose an approach to estimate the power consumption of algorithms, as a function of the frequency and number of cores, using only a very reduced set of real power measures. In addition, we also provide the formulation of a method to select the voltage–frequency scaling–concurrency throttling configurations that should be tested in order to obtain accurate estimations of the power dissipation. The power models and selection methodology are verified using two real scientific application: the stencil-based 3D MPDATA algorithm and the conjugate gradient (CG) method for sparse linear systems. MPDATA is a crucial component of the EULAG model, which is widely used in weather forecast simulations. The CG algorithm is the keystone for iterative solution of sparse symmetric positive definite linear systems via Krylov subspace methods. The reliability of the method is confirmed for a variety of ARM and Intel architectures, where the estimated results correspond to the real measured values with the average error being slightly below 5% in all cases

    Energy-Aware High Performance Computing

    Get PDF
    High performance computing centres consume substantial amounts of energy to power large-scale supercomputers and the necessary building and cooling infrastructure. Recently, considerable performance gains resulted predominantly from developments in multi-core, many-core and accelerator technology. Computing centres rapidly adopted this hardware to serve the increasing demand for computational power. However, further performance increases in large-scale computing systems are limited by the aggregate energy budget required to operate them. Power consumption has become a major cost factor for computing centres. Furthermore, energy consumption results in carbon dioxide emissions, a hazard for the environment and public health; and heat, which reduces the reliability and lifetime of hardware components. Energy efficiency is therefore crucial in high performance computing

    Patterns and Rewrite Rules for Systematic Code Generation (From High-Level Functional Patterns to High-Performance OpenCL Code)

    Get PDF
    Computing systems have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort. This results in a tension between achieving performance and code portability. Code is either tuned using device-specific optimizations to achieve maximum performance or is written in a high-level language to achieve portability at the expense of performance. We propose a novel approach that offers high-level programming, code portability and high-performance. It is based on algorithmic pattern composition coupled with a powerful, yet simple, set of rewrite rules. This enables systematic transformation and optimization of a high-level program into a low-level hardware specific representation which leads to high performance code. We test our design in practice by describing a subset of the OpenCL programming model with low-level patterns and by implementing a compiler which generates high performance OpenCL code. Our experiments show that we can systematically derive high-performance device-specific implementations from simple high-level algorithmic expressions. The performance of the generated OpenCL code is on par with highly tuned implementations for multicore CPUs and GPUs written by expertsComment: Technical Repor

    Hardware compilation of deep neural networks: an overview

    Get PDF
    Deploying a deep neural network model on a reconfigurable platform, such as an FPGA, is challenging due to the enormous design spaces of both network models and hardware design. A neural network model has various layer types, connection patterns and data representations, and the corresponding implementation can be customised with different architectural and modular parameters. Rather than manually exploring this design space, it is more effective to automate optimisation throughout an end-to-end compilation process. This paper provides an overview of recent literature proposing novel approaches to achieve this aim. We organise materials to mirror a typical compilation flow: front end, platform-independent optimisation and back end. Design templates for neural network accelerators are studied with a specific focus on their derivation methodologies. We also review previous work on network compilation and optimisation for other hardware platforms to gain inspiration regarding FPGA implementation. Finally, we propose some future directions for related research

    Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors

    Full text link
    Factorization and multiplication of dense matrices and tensors are critical, yet extremely expensive pieces of the scientific toolbox. Careful use of low rank approximation can drastically reduce the computation and memory requirements of these operations. In addition to a lower arithmetic complexity, such methods can, by their structure, be designed to efficiently exploit modern hardware architectures. The majority of existing work relies on batched BLAS libraries to handle the computation of many small dense matrices. We show that through careful analysis of the cache utilization, register accumulation using SIMD registers and a redesign of the implementation, one can achieve significantly higher throughput for these types of batched low-rank matrices across a large range of block and batch sizes. We test our algorithm on 3 CPUs using diverse ISAs -- the Fujitsu A64FX using ARM SVE, the Intel Xeon 6148 using AVX-512 and AMD EPYC 7502 using AVX-2, and show that our new batching methodology is able to obtain more than twice the throughput of vendor optimized libraries for all CPU architectures and problem sizes

    Adaptive heterogeneous parallelism for semi-empirical lattice dynamics in computational materials science.

    Get PDF
    With the variability in performance of the multitude of parallel environments available today, the conceptual overhead created by the need to anticipate runtime information to make design-time decisions has become overwhelming. Performance-critical applications and libraries carry implicit assumptions based on incidental metrics that are not portable to emerging computational platforms or even alternative contemporary architectures. Furthermore, the significance of runtime concerns such as makespan, energy efficiency and fault tolerance depends on the situational context. This thesis presents a case study in the application of both Mattsons prescriptive pattern-oriented approach and the more principled structured parallelism formalism to the computational simulation of inelastic neutron scattering spectra on hybrid CPU/GPU platforms. The original ad hoc implementation as well as new patternbased and structured implementations are evaluated for relative performance and scalability. Two new structural abstractions are introduced to facilitate adaptation by lazy optimisation and runtime feedback. A deferred-choice abstraction represents a unified space of alternative structural program variants, allowing static adaptation through model-specific exhaustive calibration with regards to the extrafunctional concerns of runtime, average instantaneous power and total energy usage. Instrumented queues serve as mechanism for structural composition and provide a representation of extrafunctional state that allows realisation of a market-based decentralised coordination heuristic for competitive resource allocation and the Lyapunov drift algorithm for cooperative scheduling

    Model-Based Design for High-Performance Signal Processing Applications

    Get PDF
    Developing high-performance signal processing applications requires not only effective signal processing algorithms but also efficient software design methods that can take full advantage of the available processing resources. An increasingly important type of hardware platform for high-performance signal processing is a multicore central processing unit (CPU) combined with a graphics processing unit (GPU) accelerator. Efficiently coordinating computations on both the host (CPU) and device (GPU), and managing host-device data transfers are critical to utilizing CPU-GPU platforms effectively. However, such coordination is challenging for system designers, given the complexity of modern signal processing applications and the stringent constraints under which they must operate. Dataflow models of computation provide a useful framework for addressing this challenge. In such a modeling approach, signal processing applications are represented as directed graphs that can be viewed intuitively as high-level signal flow diagrams. The formal, high-level abstraction provided by dataflow principles provides a useful foundation to investigate model-based analysis and optimization for new challenges in design and implementation of signal processing systems. This thesis presents a new model-based design methodology and an evolution of three novel design tools. These contributions provide an automated design flow for high performance signal processing. The design flow takes high-level dataflow representations as input and systematically derives optimized implementations on CPU-GPU platforms. The proposed design flow and associated design methodology are inspired by a previously-developed application programming interface (API) called the Hybrid Task Graph Scheduler (HTGS). HTGS was developed for implementing scalable workflows for high-performance computing applications on compute nodes that have large numbers of processing cores, and that may be equipped with multiple GPUs. However, HTGS has a limitation due to its relatively loose use of dataflow techniques (or other forms of model-based design), which results in a significant designer effort being required to apply the provided APIs effectively. The main contributions of the thesis are summarized as follows: (1) Development of a companion tool to HTGS that is called the HTGS Model-based Engine (HMBE). HMBE introduces novel capabilities to automatically analyze application dataflow graphs and generate efficient schedules for these graphs through hybrid compile-time and runtime analysis. The systematic, model-based approaches provided by HMBE enable the automation of complex tasks that must be performed manually when using HTGS alone. We have demonstrated the effectiveness of HMBE and the associated model-based design methodology through extensive experiments involving two case studies: an image stitching application for large scale microscopy images, and a background subtraction application for multispectral video streams. (2) Integration of HMBE with HTGS to develop a new design tool for the design and implementation of high-performance signal processing systems. This tool, called HMBE-Integrated-HTGS (HI-HTGS), provides novel capabilities for model-based system design, memory management, and scheduling targeted to multicore platforms. HMBE takes as input a single- or multi-dimensional dataflow model of the given signal processing application. The tool then expands the dataflow model into an expanded representation that exposes more parallelism and provides significantly more detail on the interactions between different application tasks (dataflow actors). This expanded representation is derived by HI-HTGS at compile-time and provided as input to the HI-HTGS runtime system. The runtime system in turn applies the expanded representation to guide dynamic scheduling decisions throughout system execution. (3) Extension of HMBE to the class of CPU-GPU platforms motivated above. We call this new model-based design tool the CPU-GPU Model-Based Engine (CGMBE). CGMBE uses an unfolded dataflow graph representation of the application along with thread-pool-based executors, which are optimized for efficient operation on the targeted CPU-GPU platform. This approach automates complex aspects of the design and implementation process for signal processing system designers while maximizing the utilization of computational power, reducing the memory footprint for both the CPU and GPU, and facilitating experimentation for tuning performance-oriented designs
    • …
    corecore