345 research outputs found

    Automatic Loop Kernel Analysis and Performance Modeling With Kerncraft

    Full text link
    Analytic performance models are essential for understanding the performance characteristics of loop kernels, which consume a major part of CPU cycles in computational science. Starting from a validated performance model one can infer the relevant hardware bottlenecks and promising optimization opportunities. Unfortunately, analytic performance modeling is often tedious even for experienced developers since it requires in-depth knowledge about the hardware and how it interacts with the software. We present the "Kerncraft" tool, which eases the construction of analytic performance models for streaming kernels and stencil loop nests. Starting from the loop source code, the problem size, and a description of the underlying hardware, Kerncraft can ideally predict the single-core performance and scaling behavior of loops on multicore processors using the Roofline or the Execution-Cache-Memory (ECM) model. We describe the operating principles of Kerncraft with its capabilities and limitations, and we show how it may be used to quickly gain insights by accelerated analytic modeling.Comment: 11 pages, 4 figures, 8 listing

    Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications

    Get PDF
    Energy efficiency is becoming increasingly important for computing systems, in particular for large scale HPC facilities. In this work we evaluate, from an user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS) techniques, assisted by the power and energy monitoring capabilities of modern processors in order to tune applications for energy efficiency. We run selected kernels and a full HPC application on two high-end processors widely used in the HPC context, namely an NVIDIA K80 GPU and an Intel Haswell CPU. We evaluate the available trade-offs between energy-to-solution and time-to-solution, attempting a function-by-function frequency tuning. We finally estimate the benefits obtainable running the full code on a HPC multi-GPU node, with respect to default clock frequency governors. We instrument our code to accurately monitor power consumption and execution time without the need of any additional hardware, and we enable it to change CPUs and GPUs clock frequencies while running. We analyze our results on the different architectures using a simple energy-performance model, and derive a number of energy saving strategies which can be easily adopted on recent high-end HPC systems for generic applications

    Mira: A Framework for Static Performance Analysis

    Full text link
    The performance model of an application can pro- vide understanding about its runtime behavior on particular hardware. Such information can be analyzed by developers for performance tuning. However, model building and analyzing is frequently ignored during software development until perfor- mance problems arise because they require significant expertise and can involve many time-consuming application runs. In this paper, we propose a fast, accurate, flexible and user-friendly tool, Mira, for generating performance models by applying static program analysis, targeting scientific applications running on supercomputers. We parse both the source code and binary to estimate performance attributes with better accuracy than considering just source or just binary code. Because our analysis is static, the target program does not need to be executed on the target architecture, which enables users to perform analysis on available machines instead of conducting expensive exper- iments on potentially expensive resources. Moreover, statically generated models enable performance prediction on non-existent or unavailable architectures. In addition to flexibility, because model generation time is significantly reduced compared to dynamic analysis approaches, our method is suitable for rapid application performance analysis and improvement. We present several scientific application validation results to demonstrate the current capabilities of our approach on small benchmarks and a mini application

    Solving the Klein-Gordon equation using Fourier spectral methods: A benchmark test for computer performance

    Get PDF
    The cubic Klein-Gordon equation is a simple but non-trivial partial differential equation whose numerical solution has the main building blocks required for the solution of many other partial differential equations. In this study, the library 2DECOMP&FFT is used in a Fourier spectral scheme to solve the Klein-Gordon equation and strong scaling of the code is examined on thirteen different machines for a problem size of 512^3. The results are useful in assessing likely performance of other parallel fast Fourier transform based programs for solving partial differential equations. The problem is chosen to be large enough to solve on a workstation, yet also of interest to solve quickly on a supercomputer, in particular for parametric studies. Unlike other high performance computing benchmarks, for this problem size, the time to solution will not be improved by simply building a bigger supercomputer.Comment: 10 page
    • …
    corecore