Search CORE

135 research outputs found

A Survey on Compiler Autotuning using Machine Learning

Author: Ashouri Amir H.
Cavazos John
Killian William
Palermo Gianluca
Silvano Cristina
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/09/2018
Field of study

Since the mid-1990s, researchers have been trying to use machine-learning based approaches to solve a number of different compiler optimization problems. These techniques primarily enhance the quality of the obtained results and, more importantly, make it feasible to tackle two main compiler optimization problems: optimization selection (choosing which optimizations to apply) and phase-ordering (choosing the order of applying optimizations). The compiler optimization space continues to grow due to the advancement of applications, increasing number of compiler optimizations, and new target architectures. Generic optimization passes in compilers cannot fully leverage newly introduced optimizations and, therefore, cannot keep up with the pace of increasing options. This survey summarizes and classifies the recent advances in using machine learning for the compiler optimization field, particularly on the two major problems of (1) selecting the best optimizations and (2) the phase-ordering of optimizations. The survey highlights the approaches taken so far, the obtained results, the fine-grain classification among different approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated quarterly here (Send me your new published papers to be added in the subsequent version) History: Received November 2016; Revised August 2017; Revised February 2018; Accepted March 2018

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Politecnico di Milano

SiblingRivalry: Online Autotuning Through Local Competitions

Author: Amarasinghe Saman P.
Ansel Jason Andrew
Chan Cy
O'Reilly Una-May
Olszewski Marek Krystyn
Pacula Maciej
Wong Yee Lok
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/10/2012
Field of study

Modern high performance libraries, such as ATLAS and FFTW, and programming languages, such as PetaBricks, have shown that autotuning computer programs can lead to significant speedups. However, autotuning can be burdensome to the deployment of a program, since the tuning process can take a long time and should be re-run whenever the program, microarchitecture, execution environment, or tool chain changes. Failure to re-autotune programs often leads to widespread use of sub-optimal algorithms. With the growth of cloud computing, where computations can run in environments with unknown load and migrate between different (possibly unknown) microarchitectures, the need for online autotuning has become increasingly important. We present SiblingRivalry, a new model for always-on online autotuning that allows parallel programs to continuously adapt and optimize themselves to their environment. In our system, requests are processed by dividing the available cores in half, and processing two identical requests in parallel on each half. Half of the cores are devoted to a known safe program configuration, while the other half are used for an experimental program configuration chosen by our self-adapting evolutionary algorithm. When the faster configuration completes, its results are returned, and the slower configuration is terminated. Over time, this constant experimentation allows programs to adapt to changing dynamic environments and often outperform the original algorithm that uses the entire system.United States. Dept. of Energy (DOE Award DE-SC0005288

DSpace@MIT

eScholarship - University of California

A Framework for Automated Generation of Specialized Function Variants

Author: Chaimov Nicholas
Chaimov Nicholas
Publication venue: University of Oregon
Publication date: 01/01/2012
Field of study

Efficient large-scale scientific computing requires efficient code, yet optimizing code to render it efficient simultaneously renders the code less readable, less maintainable, less portable, and requires detailed knowledge of low-level computer architecture, which the developers of scientific applications may lack. The necessary knowledge is subject to change over time as new architectures, such as GPGPU architectures like CUDA, which require very different optimizations than CPU-targeted code, become more prominent. The development of scientific cloud computing means that developers may not even know what machine their code will be running on when they are developing it. This work takes steps towards automating the generation of code variants which are automatically optimized for both execution environment and input dataset. We demonstrate that augmenting an autotuning framework with a performance database which captures metadata about environment and input and performing decision tree learning over that data can help more fully automate the process of enhancing software performance

University of Oregon Scholars' Bank

MLGOPerf: An ML Guided Inliner to Optimize Performance

Author: Ashouri Amir H.
Chan Bryan
Elhoushi Mostafa
Gao Yaoqing
Hua Yuzhe
Manzoor Muhammad Asif
Wang Xiang
Publication venue
Publication date: 19/07/2022
Field of study

For the past 25 years, we have witnessed an extensive application of Machine Learning to the Compiler space; the selection and the phase-ordering problem. However, limited works have been upstreamed into the state-of-the-art compilers, i.e., LLVM, to seamlessly integrate the former into the optimization pipeline of a compiler to be readily deployed by the user. MLGO was among the first of such projects and it only strives to reduce the code size of a binary with an ML-based Inliner using Reinforcement Learning. This paper presents MLGOPerf; the first end-to-end framework capable of optimizing performance using LLVM's ML-Inliner. It employs a secondary ML model to generate rewards used for training a retargeted Reinforcement learning agent, previously used as the primary model by MLGO. It does so by predicting the post-inlining speedup of a function under analysis and it enables a fast training framework for the primary model which otherwise wouldn't be practical. The experimental results show MLGOPerf is able to gain up to 1.8% and 2.2% with respect to LLVM's optimization at O3 when trained for performance on SPEC CPU2006 and Cbench benchmarks, respectively. Furthermore, the proposed approach provides up to 26% increased opportunities to autotune code regions for our benchmarks which can be translated into an additional 3.7% speedup value.Comment: Version 2: Added the missing Table 6. The short version of this work is accepted at ACM/IEEE CASES 202

arXiv.org e-Print Archive

Lost in translation: Exposing hidden compiler optimization opportunities

Author: Chamski Zbigniew
Eder Kerstin
Garcia Andres Amaya
Georgiou Kyriakos
May David
Publication venue: 'Oxford University Press (OUP)'
Publication date: 07/07/2020
Field of study

Existing iterative compilation and machine-learning-based optimization techniques have been proven very successful in achieving better optimizations than the standard optimization levels of a compiler. However, they were not engineered to support the tuning of a compiler's optimizer as part of the compiler's daily development cycle. In this paper, we first establish the required properties which a technique must exhibit to enable such tuning. We then introduce an enhancement to the classic nightly routine testing of compilers which exhibits all the required properties, and thus, is capable of driving the improvement and tuning of the compiler's common optimizer. This is achieved by leveraging resource usage and compilation information collected while systematically exploiting prefixes of the transformations applied at standard optimization levels. Experimental evaluation using the LLVM v6.0.1 compiler demonstrated that the new approach was able to reveal hidden cross-architecture and architecture-dependent potential optimizations on two popular processors: the Intel i5-6300U and the Arm Cortex-A53-based Broadcom BCM2837 used in the Raspberry Pi 3B+. As a case study, we demonstrate how the insights from our approach enabled us to identify and remove a significant shortcoming of the CFG simplification pass of the LLVM v6.0.1 compiler.Comment: 31 pages, 7 figures, 2 table. arXiv admin note: text overlap with arXiv:1802.0984

arXiv.org e-Print Archive

Explore Bristol Research

Master of Science

Author: Rivera Axel Y.
Publication venue: University of Utah
Publication date: 01/12/2014
Field of study

thesisTensors are mathematical representations of physical entities that have magnitude with multiple directions. Tensor contraction is a form of creating these objects using the Einstein summation equation. It is commonly used in physics and chemistry for solving problems like spectral elements and coupled cluster computation. Mathematically, tensor contraction operations can be reduced to expressions similar to matrix multiplications. However, linear algebra libraries (e.g., BLAS and LAPACK) perform poorly on the small matrix sizes that commonly arise in certain tensor contraction computations. Another challenge seen in the computation of tensor contraction is the dierence between the mathematical representation and an ecient implementation. This thesis proposes a framework that allows users to express a tensor contraction problem in a high-level mathematical representation and transform it into a linear algebra expression that is mapped to a high-performance implementation. The framework produces code that takes advantage of the parallelism that graphics processing units (GPUs) provide. It relies on autotuning to nd the preferred implementation that achieves high performance on the available device. Performance results from the benchmarks tested, nekbone and NWChem, show that the output of the framework achieves a speedup of 8.56x and 14.25x, respectively, on an NVIDIA Tesla C2050 GPU against the sequential version; while using an NVIDIA Tesla K20c GPU it achieved speedups of 8.87x and 17.62x. The parallel decompositions found by the tool were also tested with an OpenACC implementation and achieved a speedup of 8.87x and 10.42x for nekbone, while NWChem obtained a speedup of 7.25x and 10.34x compared to the choices made by default in the OpenACC compiler. The contributions of this work are: (1) a simplied interface that allows the user to express tensor contraction using a high-level representation and transform it into high-performance code; (2) a decision algorithm that explores a set of optimization strategies for achieving performance; and, (3) a demonstration that this approach can achieve better performance than OpenACC and can be used to accelerate OpenACC

The University of Utah: J. Willard Marriott Digital Library

An autotuning framework for Intel Xeon Phi platforms

Author: Christoforidis Eleftherios - Iordanis
Χριστοφορίδης Ελευθέριος - Ιορδάνης
Publication venue
Publication date: 15/09/2016
Field of study

DSpace at NTUA