18 research outputs found
Ithemal: Accurate, Portable and Fast Basic Block Throughput Estimation using Deep Neural Networks
Predicting the number of clock cycles a processor takes to execute a block of
assembly instructions in steady state (the throughput) is important for both
compiler designers and performance engineers. Building an analytical model to
do so is especially complicated in modern x86-64 Complex Instruction Set
Computer (CISC) machines with sophisticated processor microarchitectures in
that it is tedious, error prone, and must be performed from scratch for each
processor generation. In this paper we present Ithemal, the first tool which
learns to predict the throughput of a set of instructions. Ithemal uses a
hierarchical LSTM--based approach to predict throughput based on the opcodes
and operands of instructions in a basic block. We show that Ithemal is more
accurate than state-of-the-art hand-written tools currently used in compiler
backends and static machine code analyzers. In particular, our model has less
than half the error of state-of-the-art analytical models (LLVM's llvm-mca and
Intel's IACA). Ithemal is also able to predict these throughput values just as
fast as the aforementioned tools, and is easily ported across a variety of
processor microarchitectures with minimal developer effort.Comment: Published at 36th International Conference on Machine Learning (ICML)
201
COMPARISON OF INSTRUCTION SCHEDULING AND REGISTER ALLOCATION FOR MIPS AND HPL-PD ARCHITECTURE FOR EXPLOITATION OF INSTRUCTION LEVEL PARALLELISM
The integrated approaches for instruction scheduling and register allocation have been promising area of research for
code generation and compiler optimization. In this paper we have proposed an integrated algorithm for instruction
scheduling and register allocation and implemented it for compiler optimization in machine description in trimaran
infrastructure for exploitation of Instruction level parallelism. Our implementation in trimaran infrastructure shows
that our scheduler reduces the number of active live ranges dealt with linear scan allocator. As a result only few spills
were needed and the quality of the code generated was improved. For our experiments we used 20 benchmarks
available with trimaran infrastructure for HPL-PD architecture. We compare some of these results with results
obtained by Haijing Tang et al (2013) performed by LLVM compiler on MIPS architecture. For our experimental work
we added machine description (MDES) targeted to HL-PD architecture. The implemented algorithm is based on
subgraph isomorphism. The input program is represented in the form of directed acyclic graph (DAG). The vertices of
the DAG represent the instructions, input and output operands of the program, while the edges represent dependencies
among the instructions
uiCA : Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures
Performance models that statically predict the steady-state throughput of basic blocks on particular microarchitectures, such as IACA,
Ithemal, llvm-mca, OSACA, or CQA, can guide optimizing compilers and aid manual software optimization. However, their utility
heavily depends on the accuracy of their predictions. The average
error of existing models compared to measurements on the actual
hardware has been shown to lie between 9% and 36%. But how
good is this? To answer this question, we propose an extremely
simple analytical throughput model that may serve as a baseline.
Surprisingly, this model is already competitive with the state of the
art, indicating that there is significant potential for improvement.
To explore this potential, we develop a simulation-based throughput predictor. To this end, we propose a detailed parametric pipeline
model that supports all Intel Core microarchitecture generations
released between 2011 and 2021. We evaluate our predictor on an
improved version of the BHive benchmark suite and show that
its predictions are usually within 1% of measurement results, improving upon prior models by roughly an order of magnitude. The
experimental evaluation also demonstrates that several microarchitectural details considered to be rather insignificant in previous
work, are in fact essential for accurate prediction.
Our throughput predictor is available as open source
uiCA : Accurate Throughput Prediction of Basic Blocks on Recent Intel Microarchitectures
Performance models that statically predict the steady-state throughput of basic blocks on particular microarchitectures, such as IACA,
Ithemal, llvm-mca, OSACA, or CQA, can guide optimizing compilers and aid manual software optimization. However, their utility
heavily depends on the accuracy of their predictions. The average
error of existing models compared to measurements on the actual
hardware has been shown to lie between 9% and 36%. But how
good is this? To answer this question, we propose an extremely
simple analytical throughput model that may serve as a baseline.
Surprisingly, this model is already competitive with the state of the
art, indicating that there is significant potential for improvement.
To explore this potential, we develop a simulation-based throughput predictor. To this end, we propose a detailed parametric pipeline
model that supports all Intel Core microarchitecture generations
released between 2011 and 2021. We evaluate our predictor on an
improved version of the BHive benchmark suite and show that
its predictions are usually within 1% of measurement results, improving upon prior models by roughly an order of magnitude. The
experimental evaluation also demonstrates that several microarchitectural details considered to be rather insignificant in previous
work, are in fact essential for accurate prediction.
Our throughput predictor is available as open source
Survey on Combinatorial Register Allocation and Instruction Scheduling
Register allocation (mapping variables to processor registers or memory) and
instruction scheduling (reordering instructions to increase instruction-level
parallelism) are essential tasks for generating efficient assembly code in a
compiler. In the last three decades, combinatorial optimization has emerged as
an alternative to traditional, heuristic algorithms for these two tasks.
Combinatorial optimization approaches can deliver optimal solutions according
to a model, can precisely capture trade-offs between conflicting decisions, and
are more flexible at the expense of increased compilation time.
This paper provides an exhaustive literature review and a classification of
combinatorial optimization approaches to register allocation and instruction
scheduling, with a focus on the techniques that are most applied in this
context: integer programming, constraint programming, partitioned Boolean
quadratic programming, and enumeration. Researchers in compilers and
combinatorial optimization can benefit from identifying developments, trends,
and challenges in the area; compiler practitioners may discern opportunities
and grasp the potential benefit of applying combinatorial optimization
Trace-based Register Allocation in a JIT Compiler
State-of-the-art dynamic compilers often use global approaches, like Linear Scan or Graph Coloring, for register allocation. These algorithms consider the complete compilation unit for allocation, which increases the complexity of the implementation (e.g., support for lifetime holes in Linear Scan) and potentially also affects compilation time. We propose a novel non-global algorithm, which splits a compilation unit into traces based on profiling feedback and subsequently performs register allocation within each trace individually. Traces reduce the problem size to a single linear code segment, which simplifies the problem a register allocator needs to solve. Additionally, we can apply different register allocation algorithms to each trace. We show that this non-global approach can achieve results competitive to global register allocation.
We present an implementation of Trace Register Allocation based on the Graal VM and show an evaluation for common Java benchmarks. We demonstrate that performance of this non-global approach is within 3% (on AMD64) and 1% (on SPARC) of global Linear Scan register allocation.(VLID)247065
Processor Models For Instruction Scheduling using Constraint Programming
Instruction scheduling is one of the most important optimisations performed when producing code in a compiler. The problem consists of finding a minimum length schedule subject to latency and different resource constraints. This is a hard problem, classically approached by heuristic algorithms. In the last decade, research interest has shifted from heuristic to potentially optimal methods. When using optimal methods, a lot of compilation time is spent searching for an optimal solution. This makes it important that the problem definition reflects the reality of the processor. In this work, a constraint programming approach was used to study the impact that the model detail has on performance. Several models of a superscalar processor were embedded in LLVM and evaluated using SPEC CPU2000. The result shows that there is substantial performance to be gained, over 5% for some programs. The stability of the improvement is heavily dependent on the accuracy of the model
Survey on Instruction Selection: An Extensive and Modern Literature Review
Instruction selection is one of three optimisation problems involved in the
code generator backend of a compiler. The instruction selector is responsible
of transforming an input program from its target-independent representation
into a target-specific form by making best use of the available machine
instructions. Hence instruction selection is a crucial part of efficient code
generation.
Despite on-going research since the late 1960s, the last, comprehensive
survey on the field was written more than 30 years ago. As new approaches and
techniques have appeared since its publication, this brings forth a need for a
new, up-to-date review of the current body of literature. This report addresses
that need by performing an extensive review and categorisation of existing
research. The report therefore supersedes and extends the previous surveys, and
also attempts to identify where future research should be directed.Comment: Major changes: - Merged simulation chapter with macro expansion
chapter - Addressed misunderstandings of several approaches - Completely
rewrote many parts of the chapters; strengthened the discussion of many
approaches - Revised the drawing of all trees and graphs to put the root at
the top instead of at the bottom - Added appendix for listing the approaches
in a table See doc for more inf