9 research outputs found
Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures
An accurate prediction of scheduling and execution of instruction streams is
a necessary prerequisite for predicting the in-core performance behavior of
throughput-bound loop kernels on out-of-order processor architectures. Such
predictions are an indispensable component of analytical performance models,
such as the Roofline and the Execution-Cache-Memory (ECM) model, and allow a
deep understanding of the performance-relevant interactions between hardware
architecture and loop code. We present the Open Source Architecture Code
Analyzer (OSACA), a static analysis tool for predicting the execution time of
sequential loops comprising x86 instructions under the assumption of an
infinite first-level cache and perfect out-of-order scheduling. We show the
process of building a machine model from available documentation and
semi-automatic benchmarking, and carry it out for the latest Intel Skylake and
AMD Zen micro-architectures. To validate the constructed models, we apply them
to several assembly kernels and compare runtime predictions with actual
measurements. Finally we give an outlook on how the method may be generalized
to new architectures.Comment: 11 pages, 4 figures, 7 table
Automatic Throughput and Critical Path Analysis of x86 and ARM Assembly Kernels
Useful models of loop kernel runtimes on out-of-order architectures require
an analysis of the in-core performance behavior of instructions and their
dependencies. While an instruction throughput prediction sets a lower bound to
the kernel runtime, the critical path defines an upper bound. Such predictions
are an essential part of analytic (i.e., white-box) performance models like the
Roofline and Execution-Cache-Memory (ECM) models. They enable a better
understanding of the performance-relevant interactions between hardware
architecture and loop code. The Open Source Architecture Code Analyzer (OSACA)
is a static analysis tool for predicting the execution time of sequential
loops. It previously supported only x86 (Intel and AMD) architectures and
simple, optimistic full-throughput execution. We have heavily extended OSACA to
support ARM instructions and critical path prediction including the detection
of loop-carried dependencies, which turns it into a versatile
cross-architecture modeling tool. We show runtime predictions for code on Intel
Cascade Lake, AMD Zen, and Marvell ThunderX2 micro-architectures based on
machine models from available documentation and semi-automatic benchmarking.
The predictions are compared with actual measurements.Comment: 6 pages, 3 figure
Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks
Predicting the number of clock cycles a processor takes to execute a block of
assembly instructions in steady state (the throughput) is important for both
compiler designers and performance engineers. Building an analytical model to
do so is especially complicated in modern x86-64 Complex Instruction Set
Computer (CISC) machines with sophisticated processor microarchitectures in
that it is tedious, error prone, and must be performed from scratch for each
processor generation. In this paper we present Ithemal, the first tool which
learns to predict the throughput of a set of instructions. Ithemal uses a
hierarchical LSTM--based approach to predict throughput based on the opcodes
and operands of instructions in a basic block. We show that Ithemal is more
accurate than state-of-the-art hand-written tools currently used in compiler
backends and static machine code analyzers. In particular, our model has less
than half the error of state-of-the-art analytical models (LLVM's llvm-mca and
Intel's IACA). Ithemal is also able to predict these throughput values just as
fast as the aforementioned tools, and is easily ported across a variety of
processor microarchitectures with minimal developer effort.Comment: Published at 36th International Conference on Machine Learning (ICML)
201
Analytic Performance Modeling and Analysis of Detailed Neuron Simulations
Big science initiatives are trying to reconstruct and model the brain by
attempting to simulate brain tissue at larger scales and with increasingly more
biological detail than previously thought possible. The exponential growth of
parallel computer performance has been supporting these developments, and at
the same time maintainers of neuroscientific simulation code have strived to
optimally and efficiently exploit new hardware features. Current state of the
art software for the simulation of biological networks has so far been
developed using performance engineering practices, but a thorough analysis and
modeling of the computational and performance characteristics, especially in
the case of morphologically detailed neuron simulations, is lacking. Other
computational sciences have successfully used analytic performance engineering
and modeling methods to gain insight on the computational properties of
simulation kernels, aid developers in performance optimizations and eventually
drive co-design efforts, but to our knowledge a model-based performance
analysis of neuron simulations has not yet been conducted.
We present a detailed study of the shared-memory performance of
morphologically detailed neuron simulations based on the Execution-Cache-Memory
(ECM) performance model. We demonstrate that this model can deliver accurate
predictions of the runtime of almost all the kernels that constitute the neuron
models under investigation. The gained insight is used to identify the main
governing mechanisms underlying performance bottlenecks in the simulation. The
implications of this analysis on the optimization of neural simulation software
and eventually co-design of future hardware architectures are discussed. In
this sense, our work represents a valuable conceptual and quantitative
contribution to understanding the performance properties of biological networks
simulations.Comment: 18 pages, 6 figures, 15 table
From micro-OPs to abstract resources: constructing a simpler CPU performance model through microbenchmarking
This paper describes PALMED, a tool that automatically builds a resource
mapping, a performance model for pipelined, super-scalar, out-of-order CPU
architectures. Resource mappings describe the execution of a program by
assigning instructions in the program to abstract resources. They can be used
to predict the throughput of basic blocks or as a machine model for the backend
of an optimizing compiler. PALMED does not require hardware performance
counters, and relies solely on runtime measurements to construct resource
mappings. This allows it to model not only execution port usage, but also other
limiting resources, such as the frontend or the reorder buffer. Also, thanks to
a dual representation of resource mappings, our algorithm for constructing
mappings scales to large instruction sets, like that of x86. We evaluate the
algorithmic contribution of the paper in two ways. First by showing that our
approach can reverse engineering an accurate resource mapping from an
idealistic performance model produced by an existing port-mapping. We also
evaluate the pertinence of our dual representation, as opposed to the standard
port-mapping, for throughput modeling by extracting a representative set of
basic-blocks from the compiled binaries of the Spec CPU 2017 benchmarks and
comparing the throughput predicted by existing machine models to that produced
by PALMED
SMoTherSpectre: exploiting speculative execution through port contention
Spectre, Meltdown, and related attacks have demonstrated that kernels,
hypervisors, trusted execution environments, and browsers are prone to
information disclosure through micro-architectural weaknesses. However, it
remains unclear as to what extent other applications, in particular those that
do not load attacker-provided code, may be impacted. It also remains unclear as
to what extent these attacks are reliant on cache-based side channels.
We introduce SMoTherSpectre, a speculative code-reuse attack that leverages
port-contention in simultaneously multi-threaded processors (SMoTher) as a side
channel to leak information from a victim process. SMoTher is a fine-grained
side channel that detects contention based on a single victim instruction. To
discover real-world gadgets, we describe a methodology and build a tool that
locates SMoTher-gadgets in popular libraries. In an evaluation on glibc, we
found hundreds of gadgets that can be used to leak information. Finally, we
demonstrate proof-of-concept attacks against the OpenSSH server, creating
oracles for determining four host key bits, and against an application
performing encryption using the OpenSSL library, creating an oracle which can
differentiate a bit of the plaintext through gadgets in libcrypto and glibc