Search CORE

2,823 research outputs found

Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels

Author: C. Evans
D. Unat
J. Hammer
J. Hofmann
M. Wittmann
S. Williams
T. Grosser
Y. Lo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/01/2017
Field of study

Achieving optimal program performance requires deep insight into the interaction between hardware and software. For software developers without an in-depth background in computer architecture, understanding and fully utilizing modern architectures is close to impossible. Analytic loop performance modeling is a useful way to understand the relevant bottlenecks of code execution based on simple machine models. The Roofline Model and the Execution-Cache-Memory (ECM) model are proven approaches to performance modeling of loop nests. In comparison to the Roofline model, the ECM model can also describes the single-core performance and saturation behavior on a multicore chip. We give an introduction to the Roofline and ECM models, and to stencil performance modeling using layer conditions (LC). We then present Kerncraft, a tool that can automatically construct Roofline and ECM models for loop nests by performing the required code, data transfer, and LC analysis. The layer condition analysis allows to predict optimal spatial blocking factors for loop nests. Together with the models it enables an ab-initio estimate of the potential benefits of loop blocking optimizations and of useful block sizes. In cases where LC analysis is not easily possible, Kerncraft supports a cache simulator as a fallback option. Using a 25-point long-range stencil we demonstrate the usefulness and predictive power of the Kerncraft tool.Comment: 22 pages, 5 figure

arXiv.org e-Print Archive

Crossref

Automatic Loop Kernel Analysis and Performance Modeling With Kerncraft

Author: Eitzinger Jan
Hager Georg
Hammer Julian
Wellein Gerhard
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

Analytic performance models are essential for understanding the performance characteristics of loop kernels, which consume a major part of CPU cycles in computational science. Starting from a validated performance model one can infer the relevant hardware bottlenecks and promising optimization opportunities. Unfortunately, analytic performance modeling is often tedious even for experienced developers since it requires in-depth knowledge about the hardware and how it interacts with the software. We present the "Kerncraft" tool, which eases the construction of analytic performance models for streaming kernels and stencil loop nests. Starting from the loop source code, the problem size, and a description of the underlying hardware, Kerncraft can ideally predict the single-core performance and scaling behavior of loops on multicore processors using the Roofline or the Execution-Cache-Memory (ECM) model. We describe the operating principles of Kerncraft with its capabilities and limitations, and we show how it may be used to quickly gain insights by accelerated analytic modeling.Comment: 11 pages, 4 figures, 8 listing

arXiv.org e-Print Archive

Crossref

Performance and resource modeling for FPGAs using high-level synthesis tools

Author: Braeken An
D'Hollander Erik
da Silva Gomes Bruno
Touhafi Abdellah
Publication venue: 'IOS Press'
Publication date: 01/01/2014
Field of study

High-performance computing with FPGAs is gaining momentum with the advent of sophisticated High-Level Synthesis (HLS) tools. The performance of a design is impacted by the input-output bandwidth, the code optimizations and the resource consumption, making the performance estimation a challenge. This paper proposes a performance model which extends the roofline model to take into account the resource consumption and the parameters used in the HLS tools. A strategy is developed which maximizes the performance and the resource utilization within the area of the FPGA. The model is used to optimize the design exploration of a class of window-based image processing application

Ghent University Academic Bibliography

Study of combining GPU/FPGA accelerators for high-performance computing

Author: Braeken An
Cornelis Jan G
D'Hollander Erik
da Silva Gomes Bruno
Lemeire Jan
Touhafi Abdellah
Publication venue: HiPEAC
Publication date: 01/01/2013
Field of study

This contribution presents the performance modeling of a super desktop with GPU and FPGA accelerators, using OpenCL for the GPU and a high-level synthesis compiler for the FPGAs. The performance model is used to evaluate the different high-level synthesis optimizations, taking into account the resource usage, and to compare the compute power of the FPGA with the GP

Ghent University Academic Bibliography

Approximate FPGA-based LSTMs under Computation Time Constraints

Author: Bouganis Christos-Savvas
Kouris Alexandros
Rizakis Michalis
Venieris Stylianos I.
Publication venue
Publication date: 05/01/2018
Field of study

Recurrent Neural Networks and in particular Long Short-Term Memory (LSTM) networks have demonstrated state-of-the-art accuracy in several emerging Artificial Intelligence tasks. However, the models are becoming increasingly demanding in terms of computational and memory load. Emerging latency-sensitive applications including mobile robots and autonomous vehicles often operate under stringent computation time constraints. In this paper, we address the challenge of deploying computationally demanding LSTMs at a constrained time budget by introducing an approximate computing scheme that combines iterative low-rank compression and pruning, along with a novel FPGA-based LSTM architecture. Combined in an end-to-end framework, the approximation method's parameters are optimised and the architecture is configured to address the problem of high-performance LSTM execution in time-constrained applications. Quantitative evaluation on a real-life image captioning application indicates that the proposed methods required up to 6.5x less time to achieve the same application-level accuracy compared to a baseline method, while achieving an average of 25x higher accuracy under the same computation time constraints.Comment: Accepted at the 14th International Symposium in Applied Reconfigurable Computing (ARC) 201

arXiv.org e-Print Archive

Crossref

Spiral - Imperial College Digital Repository