37,211 research outputs found
Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels
Useful models of loop kernel runtimes on out-of-order architectures require
an analysis of the in-core performance behavior of instructions and their
dependencies. While an instruction throughput prediction sets a lower bound to
the kernel runtime, the critical path defines an upper bound. Such predictions
are an essential part of analytic (i.e., white-box) performance models like the
Roofline and Execution-Cache-Memory (ECM) models. They enable a better
understanding of the performance-relevant interactions between hardware
architecture and loop code. The Open Source Architecture Code Analyzer (OSACA)
is a static analysis tool for predicting the execution time of sequential
loops. It previously supported only x86 (Intel and AMD) architectures and
simple, optimistic full-throughput execution. We have heavily extended OSACA to
support ARM instructions and critical path prediction including the detection
of loop-carried dependencies, which turns it into a versatile
cross-architecture modeling tool. We show runtime predictions for code on Intel
Cascade Lake, AMD Zen, and Marvell ThunderX2 micro-architectures based on
machine models from available documentation and semi-automatic benchmarking.
The predictions are compared with actual measurements.Comment: 6 pages, 3 figure
Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures
An accurate prediction of scheduling and execution of instruction streams is
a necessary prerequisite for predicting the in-core performance behavior of
throughput-bound loop kernels on out-of-order processor architectures. Such
predictions are an indispensable component of analytical performance models,
such as the Roofline and the Execution-Cache-Memory (ECM) model, and allow a
deep understanding of the performance-relevant interactions between hardware
architecture and loop code. We present the Open Source Architecture Code
Analyzer (OSACA), a static analysis tool for predicting the execution time of
sequential loops comprising x86 instructions under the assumption of an
infinite first-level cache and perfect out-of-order scheduling. We show the
process of building a machine model from available documentation and
semi-automatic benchmarking, and carry it out for the latest Intel Skylake and
AMD Zen micro-architectures. To validate the constructed models, we apply them
to several assembly kernels and compare runtime predictions with actual
measurements. Finally we give an outlook on how the method may be generalized
to new architectures.Comment: 11 pages, 4 figures, 7 table
Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels
Achieving optimal program performance requires deep insight into the
interaction between hardware and software. For software developers without an
in-depth background in computer architecture, understanding and fully utilizing
modern architectures is close to impossible. Analytic loop performance modeling
is a useful way to understand the relevant bottlenecks of code execution based
on simple machine models. The Roofline Model and the Execution-Cache-Memory
(ECM) model are proven approaches to performance modeling of loop nests. In
comparison to the Roofline model, the ECM model can also describes the
single-core performance and saturation behavior on a multicore chip. We give an
introduction to the Roofline and ECM models, and to stencil performance
modeling using layer conditions (LC). We then present Kerncraft, a tool that
can automatically construct Roofline and ECM models for loop nests by
performing the required code, data transfer, and LC analysis. The layer
condition analysis allows to predict optimal spatial blocking factors for loop
nests. Together with the models it enables an ab-initio estimate of the
potential benefits of loop blocking optimizations and of useful block sizes. In
cases where LC analysis is not easily possible, Kerncraft supports a cache
simulator as a fallback option. Using a 25-point long-range stencil we
demonstrate the usefulness and predictive power of the Kerncraft tool.Comment: 22 pages, 5 figure
Automatic Loop Kernel Analysis and Performance Modeling With Kerncraft
Analytic performance models are essential for understanding the performance
characteristics of loop kernels, which consume a major part of CPU cycles in
computational science. Starting from a validated performance model one can
infer the relevant hardware bottlenecks and promising optimization
opportunities. Unfortunately, analytic performance modeling is often tedious
even for experienced developers since it requires in-depth knowledge about the
hardware and how it interacts with the software. We present the "Kerncraft"
tool, which eases the construction of analytic performance models for streaming
kernels and stencil loop nests. Starting from the loop source code, the problem
size, and a description of the underlying hardware, Kerncraft can ideally
predict the single-core performance and scaling behavior of loops on multicore
processors using the Roofline or the Execution-Cache-Memory (ECM) model. We
describe the operating principles of Kerncraft with its capabilities and
limitations, and we show how it may be used to quickly gain insights by
accelerated analytic modeling.Comment: 11 pages, 4 figures, 8 listing
Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks
Predicting the number of clock cycles a processor takes to execute a block of
assembly instructions in steady state (the throughput) is important for both
compiler designers and performance engineers. Building an analytical model to
do so is especially complicated in modern x86-64 Complex Instruction Set
Computer (CISC) machines with sophisticated processor microarchitectures in
that it is tedious, error prone, and must be performed from scratch for each
processor generation. In this paper we present Ithemal, the first tool which
learns to predict the throughput of a set of instructions. Ithemal uses a
hierarchical LSTM--based approach to predict throughput based on the opcodes
and operands of instructions in a basic block. We show that Ithemal is more
accurate than state-of-the-art hand-written tools currently used in compiler
backends and static machine code analyzers. In particular, our model has less
than half the error of state-of-the-art analytical models (LLVM's llvm-mca and
Intel's IACA). Ithemal is also able to predict these throughput values just as
fast as the aforementioned tools, and is easily ported across a variety of
processor microarchitectures with minimal developer effort.Comment: Published at 36th International Conference on Machine Learning (ICML)
201
- …