5,292 research outputs found
Future Directions for Optimizing Compilers
As software becomes larger, programming languages become higher-level, and
processors continue to fail to be clocked faster, we'll increasingly require
compilers to reduce code bloat, eliminate abstraction penalties, and exploit
interesting instruction sets. At the same time, compiler execution time must
not increase too much and also compilers should never produce the wrong output.
This paper examines the problem of making optimizing compilers faster, less
buggy, and more capable of generating high-quality output
Souper: A Synthesizing Superoptimizer
If we can automatically derive compiler optimizations, we might be able to
sidestep some of the substantial engineering challenges involved in creating
and maintaining a high-quality compiler. We developed Souper, a synthesizing
superoptimizer, to see how far these ideas might be pushed in the context of
LLVM. Along the way, we discovered that Souper's intermediate representation
was sufficiently similar to the one in Microsoft Visual C++ that we applied
Souper to that compiler as well. Shipping, or about-to-ship, versions of both
compilers contain optimizations suggested by Souper but implemented by hand.
Alternately, when Souper is used as a fully automated optimization pass it
compiles a Clang compiler binary that is about 3 MB (4.4%) smaller than the one
compiled by LLVM
Improved Basic Block Reordering
Basic block reordering is an important step for profile-guided binary
optimization. The state-of-the-art goal for basic block reordering is to
maximize the number of fall-through branches. However, we demonstrate that such
orderings may impose suboptimal performance on instruction and I-TLB caches. We
propose a new algorithm that relies on a model combining the effects of
fall-through and caching behavior. As details of modern processor caching is
quite complex and often unknown, we show how to use machine learning in
selecting parameters that best trade off different caching effects to maximize
binary performance.
An extensive evaluation on a variety of applications, including Facebook
production workloads, the open-source compilers Clang and GCC, and SPEC CPU
benchmarks, indicate that the new method outperforms existing block reordering
techniques, improving the resulting performance of applications with large code
size. We have open sourced the code of the new algorithm as a part of a
post-link binary optimization tool, BOLT.Comment: Published in IEEE Transactions on Computer
Conformal Computing: Algebraically connecting the hardware/software boundary using a uniform approach to high-performance computation for software and hardware applications
We present a systematic, algebraically based, design methodology for
efficient implementation of computer programs optimized over multiple levels of
the processor/memory and network hierarchy. Using a common formalism to
describe the problem and the partitioning of data over processors and memory
levels allows one to mathematically prove the efficiency and correctness of a
given algorithm as measured in terms of a set of metrics (such as
processor/network speeds, etc.). The approach allows the average programmer to
achieve high-level optimizations similar to those used by compiler writers
(e.g. the notion of "tiling").
The approach presented in this monograph makes use of A Mathematics of Arrays
(MoA, Mullin 1988) and an indexing calculus (i.e. the psi-calculus) to enable
the programmer to develop algorithms using high-level compiler-like
optimizations through the ability to algebraically compose and reduce sequences
of array operations. Extensive discussion and benchmark results are presented
for the Fast Fourier Transform and other important algorithms
Improving incremental signature-based Groebner basis algorithms
In this paper we describe a combination of ideas to improve incremental
signature-based Groebner basis algorithms having a big impact on their
performance. Besides explaining how to combine already known optimizations to
achieve more efficient algorithms, we compare the quite different effects on
the two best-known algorithms in this area, F5 and G2V, both from a theoretical
and a practical point of view.Comment: 19 pages, 3 table
EmptyHeaded: A Relational Engine for Graph Processing
There are two types of high-performance graph processing engines: low- and
high-level engines. Low-level engines (Galois, PowerGraph, Snap) provide
optimized data structures and computation models but require users to write
low-level imperative code, hence ensuring that efficiency is the burden of the
user. In high-level engines, users write in query languages like datalog
(SociaLite) or SQL (Grail). High-level engines are easier to use but are orders
of magnitude slower than the low-level graph engines. We present EmptyHeaded, a
high-level engine that supports a rich datalog-like query language and achieves
performance comparable to that of low-level engines. At the core of
EmptyHeaded's design is a new class of join algorithms that satisfy strong
theoretical guarantees but have thus far not achieved performance comparable to
that of specialized graph processing engines. To achieve high performance,
EmptyHeaded introduces a new join engine architecture, including a novel query
optimizer and data layouts that leverage single-instruction multiple data
(SIMD) parallelism. With this architecture, EmptyHeaded outperforms high-level
approaches by up to three orders of magnitude on graph pattern queries,
PageRank, and Single-Source Shortest Paths (SSSP) and is an order of magnitude
faster than many low-level baselines. We validate that EmptyHeaded competes
with the best-of-breed low-level engine (Galois), achieving comparable
performance on PageRank and at most 3x worse performance on SSSP
LevelHeaded: Making Worst-Case Optimal Joins Work in the Common Case
Pipelines combining SQL-style business intelligence (BI) queries and linear
algebra (LA) are becoming increasingly common in industry. As a result, there
is a growing need to unify these workloads in a single framework.
Unfortunately, existing solutions either sacrifice the inherent benefits of
exclusively using a relational database (e.g. logical and physical
independence) or incur orders of magnitude performance gaps compared to
specialized engines (or both). In this work we study applying a new type of
query processing architecture to standard BI and LA benchmarks. To do this we
present a new in-memory query processing engine called LevelHeaded. LevelHeaded
uses worst-case optimal joins as its core execution mechanism for both BI and
LA queries. With LevelHeaded, we show how crucial optimizations for BI and LA
queries can be captured in a worst-case optimal query architecture. Using these
optimizations, LevelHeaded outperforms other relational database engines
(LogicBlox, MonetDB, and HyPer) by orders of magnitude on standard LA
benchmarks, while performing on average within 31% of the best-of-breed BI
(HyPer) and LA (Intel MKL) solutions on their own benchmarks. Our results show
that such a single query processing architecture is capable of delivering
competitive performance on both BI and LA queries
Sparse Persistent RNNs: Squeezing Large Recurrent Networks On-Chip
Recurrent Neural Networks (RNNs) are powerful tools for solving
sequence-based problems, but their efficacy and execution time are dependent on
the size of the network. Following recent work in simplifying these networks
with model pruning and a novel mapping of work onto GPUs, we design an
efficient implementation for sparse RNNs. We investigate several optimizations
and tradeoffs: Lamport timestamps, wide memory loads, and a bank-aware weight
layout. With these optimizations, we achieve speedups of over 6x over the next
best algorithm for a hidden layer of size 2304, batch size of 4, and a density
of 30%. Further, our technique allows for models of over 5x the size to fit on
a GPU for a speedup of 2x, enabling larger networks to help advance the
state-of-the-art. We perform case studies on NMT and speech recognition tasks
in the appendix, accelerating their recurrent layers by up to 3x.Comment: Published as a conference paper at ICLR 201
Relay: A High-Level Compiler for Deep Learning
Frameworks for writing, compiling, and optimizing deep learning (DL) models
have recently enabled progress in areas like computer vision and natural
language processing. Extending these frameworks to accommodate the rapidly
diversifying landscape of DL models and hardware platforms presents challenging
tradeoffs between expressivity, composability, and portability. We present
Relay, a new compiler framework for DL. Relay's functional, statically typed
intermediate representation (IR) unifies and generalizes existing DL IRs to
express state-of-the-art models. The introduction of Relay's expressive IR
requires careful design of domain-specific optimizations, addressed via Relay's
extension mechanisms. Using these extension mechanisms, Relay supports a
unified compiler that can target a variety of hardware platforms. Our
evaluation demonstrates Relay's competitive performance for a broad class of
models and devices (CPUs, GPUs, and emerging accelerators). Relay's design
demonstrates how a unified IR can provide expressivity, composability, and
portability without compromising performance
Optimized Polynomial Evaluation with Semantic Annotations
In this paper we discuss how semantic annotations can be used to introduce
mathematical algorithmic information of the underlying imperative code to
enable compilers to produce code transformations that will enable better
performance. By using this approaches not only good performance is achieved,
but also better programmability, maintainability and portability across
different hardware architectures. To exemplify this we will use polynomial
equations of different degrees.Comment: Part of the Program Transformation for Programmability in
Heterogeneous Architectures (PROHA) workshop, Barcelona, Spain, 12th March
2016, 7 pages, LaTeX, 4 PNG figure
- …