Search CORE

5,292 research outputs found

Future Directions for Optimizing Compilers

Author: Lopes Nuno P.
Regehr John
Publication venue
Publication date: 06/09/2018
Field of study

As software becomes larger, programming languages become higher-level, and processors continue to fail to be clocked faster, we'll increasingly require compilers to reduce code bloat, eliminate abstraction penalties, and exploit interesting instruction sets. At the same time, compiler execution time must not increase too much and also compilers should never produce the wrong output. This paper examines the problem of making optimizing compilers faster, less buggy, and more capable of generating high-quality output

arXiv.org e-Print Archive

Souper: A Synthesizing Superoptimizer

Author: Chen Yang
Collingbourne Peter
Ketema Jeroen
Lup Gratian
Regehr John
Sasnauskas Raimondas
Taneja Jubi
Publication venue
Publication date: 05/04/2018
Field of study

If we can automatically derive compiler optimizations, we might be able to sidestep some of the substantial engineering challenges involved in creating and maintaining a high-quality compiler. We developed Souper, a synthesizing superoptimizer, to see how far these ideas might be pushed in the context of LLVM. Along the way, we discovered that Souper's intermediate representation was sufficiently similar to the one in Microsoft Visual C++ that we applied Souper to that compiler as well. Shipping, or about-to-ship, versions of both compilers contain optimizations suggested by Souper but implemented by hand. Alternately, when Souper is used as a fully automated optimization pass it compiles a Clang compiler binary that is about 3 MB (4.4%) smaller than the one compiled by LLVM

arXiv.org e-Print Archive

Improved Basic Block Reordering

Author: Newell Andy
Pupyrev Sergey
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/04/2020
Field of study

Basic block reordering is an important step for profile-guided binary optimization. The state-of-the-art goal for basic block reordering is to maximize the number of fall-through branches. However, we demonstrate that such orderings may impose suboptimal performance on instruction and I-TLB caches. We propose a new algorithm that relies on a model combining the effects of fall-through and caching behavior. As details of modern processor caching is quite complex and often unknown, we show how to use machine learning in selecting parameters that best trade off different caching effects to maximize binary performance. An extensive evaluation on a variety of applications, including Facebook production workloads, the open-source compilers Clang and GCC, and SPEC CPU benchmarks, indicate that the new method outperforms existing block reordering techniques, improving the resulting performance of applications with large code size. We have open sourced the code of the new algorithm as a part of a post-link binary optimization tool, BOLT.Comment: Published in IEEE Transactions on Computer

arXiv.org e-Print Archive

Conformal Computing: Algebraically connecting the hardware/software boundary using a uniform approach to high-performance computation for software and hardware applications

Author: Mullin Lenore R.
Raynolds James E.
Publication venue
Publication date: 16/03/2008
Field of study

We present a systematic, algebraically based, design methodology for efficient implementation of computer programs optimized over multiple levels of the processor/memory and network hierarchy. Using a common formalism to describe the problem and the partitioning of data over processors and memory levels allows one to mathematically prove the efficiency and correctness of a given algorithm as measured in terms of a set of metrics (such as processor/network speeds, etc.). The approach allows the average programmer to achieve high-level optimizations similar to those used by compiler writers (e.g. the notion of "tiling"). The approach presented in this monograph makes use of A Mathematics of Arrays (MoA, Mullin 1988) and an indexing calculus (i.e. the psi-calculus) to enable the programmer to develop algorithms using high-level compiler-like optimizations through the ability to algebraically compose and reduce sequences of array operations. Extensive discussion and benchmark results are presented for the Fast Fourier Transform and other important algorithms

arXiv.org e-Print Archive

Improving incremental signature-based Groebner basis algorithms

Author: Eder Christian
Publication venue
Publication date: 26/03/2012
Field of study

In this paper we describe a combination of ideas to improve incremental signature-based Groebner basis algorithms having a big impact on their performance. Besides explaining how to combine already known optimizations to achieve more efficient algorithms, we compare the quite different effects on the two best-known algorithms in this area, F5 and G2V, both from a theoretical and a practical point of view.Comment: 19 pages, 3 table

arXiv.org e-Print Archive

EmptyHeaded: A Relational Engine for Graph Processing

Author: Aberger Christopher R.
Olukotun Kunle
Ré Christopher
Tu Susan
Publication venue
Publication date: 05/01/2017
Field of study

There are two types of high-performance graph processing engines: low- and high-level engines. Low-level engines (Galois, PowerGraph, Snap) provide optimized data structures and computation models but require users to write low-level imperative code, hence ensuring that efficiency is the burden of the user. In high-level engines, users write in query languages like datalog (SociaLite) or SQL (Grail). High-level engines are easier to use but are orders of magnitude slower than the low-level graph engines. We present EmptyHeaded, a high-level engine that supports a rich datalog-like query language and achieves performance comparable to that of low-level engines. At the core of EmptyHeaded's design is a new class of join algorithms that satisfy strong theoretical guarantees but have thus far not achieved performance comparable to that of specialized graph processing engines. To achieve high performance, EmptyHeaded introduces a new join engine architecture, including a novel query optimizer and data layouts that leverage single-instruction multiple data (SIMD) parallelism. With this architecture, EmptyHeaded outperforms high-level approaches by up to three orders of magnitude on graph pattern queries, PageRank, and Single-Source Shortest Paths (SSSP) and is an order of magnitude faster than many low-level baselines. We validate that EmptyHeaded competes with the best-of-breed low-level engine (Galois), achieving comparable performance on PageRank and at most 3x worse performance on SSSP

arXiv.org e-Print Archive

LevelHeaded: Making Worst-Case Optimal Joins Work in the Common Case

Author: Aberger Christopher R.
Lamb Andrew
Olukotun Kunle
Ré Christopher
Publication venue
Publication date: 25/08/2017
Field of study

Pipelines combining SQL-style business intelligence (BI) queries and linear algebra (LA) are becoming increasingly common in industry. As a result, there is a growing need to unify these workloads in a single framework. Unfortunately, existing solutions either sacrifice the inherent benefits of exclusively using a relational database (e.g. logical and physical independence) or incur orders of magnitude performance gaps compared to specialized engines (or both). In this work we study applying a new type of query processing architecture to standard BI and LA benchmarks. To do this we present a new in-memory query processing engine called LevelHeaded. LevelHeaded uses worst-case optimal joins as its core execution mechanism for both BI and LA queries. With LevelHeaded, we show how crucial optimizations for BI and LA queries can be captured in a worst-case optimal query architecture. Using these optimizations, LevelHeaded outperforms other relational database engines (LogicBlox, MonetDB, and HyPer) by orders of magnitude on standard LA benchmarks, while performing on average within 31% of the best-of-breed BI (HyPer) and LA (Intel MKL) solutions on their own benchmarks. Our results show that such a single query processing architecture is capable of delivering competitive performance on both BI and LA queries

arXiv.org e-Print Archive

Sparse Persistent RNNs: Squeezing Large Recurrent Networks On-Chip

Author: Andersch Michael
Appleyard Jeremy
Pool Jeff
Xie Fung
Zhu Feiwen
Publication venue
Publication date: 26/04/2018
Field of study

Recurrent Neural Networks (RNNs) are powerful tools for solving sequence-based problems, but their efficacy and execution time are dependent on the size of the network. Following recent work in simplifying these networks with model pruning and a novel mapping of work onto GPUs, we design an efficient implementation for sparse RNNs. We investigate several optimizations and tradeoffs: Lamport timestamps, wide memory loads, and a bank-aware weight layout. With these optimizations, we achieve speedups of over 6x over the next best algorithm for a hidden layer of size 2304, batch size of 4, and a density of 30%. Further, our technique allows for models of over 5x the size to fit on a GPU for a speedup of 2x, enabling larger networks to help advance the state-of-the-art. We perform case studies on NMT and speech recognition tasks in the appendix, accelerating their recurrent layers by up to 3x.Comment: Published as a conference paper at ICLR 201

arXiv.org e-Print Archive

Relay: A High-Level Compiler for Deep Learning

Author: Chen Tianqi
Jiang Ziheng
Kirisame Marisa
Lyubomirsky Steven
Moreau Thierry
Pollock Josh
Roesch Jared
Tatlock Zachary
Vega Luis
Weber Logan
Publication venue
Publication date: 24/08/2019
Field of study

Frameworks for writing, compiling, and optimizing deep learning (DL) models have recently enabled progress in areas like computer vision and natural language processing. Extending these frameworks to accommodate the rapidly diversifying landscape of DL models and hardware platforms presents challenging tradeoffs between expressivity, composability, and portability. We present Relay, a new compiler framework for DL. Relay's functional, statically typed intermediate representation (IR) unifies and generalizes existing DL IRs to express state-of-the-art models. The introduction of Relay's expressive IR requires careful design of domain-specific optimizations, addressed via Relay's extension mechanisms. Using these extension mechanisms, Relay supports a unified compiler that can target a variety of hardware platforms. Our evaluation demonstrates Relay's competitive performance for a broad class of models and devices (CPUs, GPUs, and emerging accelerators). Relay's design demonstrates how a unified IR can provide expressivity, composability, and portability without compromising performance

arXiv.org e-Print Archive

Optimized Polynomial Evaluation with Semantic Annotations

Author: Bonilla Daniel Rubio
Glass Colin W.
Kuper Jan
Publication venue
Publication date: 11/03/2016
Field of study

In this paper we discuss how semantic annotations can be used to introduce mathematical algorithmic information of the underlying imperative code to enable compilers to produce code transformations that will enable better performance. By using this approaches not only good performance is achieved, but also better programmability, maintainability and portability across different hardware architectures. To exemplify this we will use polynomial equations of different degrees.Comment: Part of the Program Transformation for Programmability in Heterogeneous Architectures (PROHA) workshop, Barcelona, Spain, 12th March 2016, 7 pages, LaTeX, 4 PNG figure

arXiv.org e-Print Archive