396 research outputs found

    A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details

    Get PDF
    Today’s compilers have a plethora of optimizations-transformations to choose from, and the correct choice, order as well parameters of transformations have a significant/large impact on performance; choosing the correct order and parameters of optimizations has been a long standing problem in compilation research, which until now remains unsolved; the separate sub-problems optimization gives a different schedule/binary for each sub-problem and these schedules cannot coexist, as by refining one degrades the other. Researchers try to solve this problem by using iterative compilation techniques but the search space is so big that it cannot be searched even by using modern supercomputers. Moreover, compiler transformations do not take into account the hardware architecture details and data reuse in an efficient way. In this paper, a new iterative compilation methodology is presented which reduces the search space of six compiler transformations by addressing the above problems; the search space is reduced by many orders of magnitude and thus an efficient solution is now capable to be found. The transformations are the following: loop tiling (including the number of the levels of tiling), loop unroll, register allocation, scalar replacement, loop interchange and data array layouts. The search space is reduced (a) by addressing the aforementioned transformations together as one problem and not separately, (b) by taking into account the custom hardware architecture details (e.g., cache size and associativity) and algorithm characteristics (e.g., data reuse). The proposed methodology has been evaluated over iterative compilation and gcc/icc compilers, on both embedded and general purpose processors; it achieves significant performance gains at many orders of magnitude lower compilation time

    A Survey on Compiler Autotuning using Machine Learning

    Full text link
    Since the mid-1990s, researchers have been trying to use machine-learning based approaches to solve a number of different compiler optimization problems. These techniques primarily enhance the quality of the obtained results and, more importantly, make it feasible to tackle two main compiler optimization problems: optimization selection (choosing which optimizations to apply) and phase-ordering (choosing the order of applying optimizations). The compiler optimization space continues to grow due to the advancement of applications, increasing number of compiler optimizations, and new target architectures. Generic optimization passes in compilers cannot fully leverage newly introduced optimizations and, therefore, cannot keep up with the pace of increasing options. This survey summarizes and classifies the recent advances in using machine learning for the compiler optimization field, particularly on the two major problems of (1) selecting the best optimizations and (2) the phase-ordering of optimizations. The survey highlights the approaches taken so far, the obtained results, the fine-grain classification among different approaches and finally, the influential papers of the field.Comment: version 5.0 (updated on September 2018)- Preprint Version For our Accepted Journal @ ACM CSUR 2018 (42 pages) - This survey will be updated quarterly here (Send me your new published papers to be added in the subsequent version) History: Received November 2016; Revised August 2017; Revised February 2018; Accepted March 2018

    Compiler-centric across-stack deep learning acceleration

    Get PDF
    Optimizing the deployment of Deep Neural Networks (DNNs) is hard. Despite deep learning approaches increasingly providing state-of-the-art solutions to a variety of difficult problems, such as computer vision and natural language processing, DNNs can be prohibitively expensive, for example, in terms of inference time or memory usage. Effective exploration of the design space requires a holistic approach, including a range of topics from machine learning, systems, and hardware. The rapid proliferation of deep learning applications has raised demand for efficient exploration and acceleration of deep learning based solutions. However, managing the range of optimization techniques, as well as how they interact with each other across the stack is a non-trivial task. A family of emerging specialized compilers for deep learning, tensor compilers, appear to be a strong candidate to help manage the complexity of across-stack optimization choices, and enable new approaches. This thesis presents new techniques and explorations of the Deep Learning Acceleration Stack (DLAS), with the perspective that the tensor compiler will increasingly be the center of this stack. First, we motivate the challenges in exploring DLAS, by describing the experience of running a perturbation study varying parameters at every layer of the stack. The core of the study is implemented using a tensor compiler, which reduces the complexity of evaluating the wide range of variants, although still requires a significant engineering effort to realize. Next, we develop a new algorithm for grouped convolution, a model optimization technique for which existing solutions provided poor inference time scaling. We implement and optimize our algorithm using a tensor compiler, outperforming existing approaches by 5.1× on average (arithmetic mean). Finally, we propose a technique, transfer-tuning, to reduce the search time required for automatic tensor compiler code optimization, reducing the search time required by 6.5× on average. The techniques and contributions of this thesis across these interconnected domains demonstrate the exciting potential of tensor compilers to simplify and improve design space exploration for DNNs, and their deployment. The outcomes of this thesis enable new lines of research to enable machine learning developers to keep up with the rapidly evolving landscape of neural architectures and hardware

    High Performance with Prescriptive Optimization and Debugging

    Get PDF

    Compilers that learn to optimise: a probabilistic machine learning approach

    Get PDF
    Compiler optimisation is the process of making a compiler produce better code, i.e. code that, for example, runs faster on a target architecture. Although numerous program transformations for optimisation have been proposed in the literature, these transformations are not always beneficial and they can interact in very complex ways. Traditional approaches adopted by compiler writers fix the order of the transformations and decide when and how these transformations should be applied to a program by using hard-coded heuristics. However, these heuristics require a lot of time and effort to construct and may sacrifice performance on programs they have not been tuned for.This thesis proposes a probabilistic machine learning solution to the compiler optimisation problem that automatically determines "good" optimisation strategies for programs. This approach uses predictive modelling in order to search the space of compiler transformations. Unlike most previous work that learns when/how to apply a single transformation in isolation or a fixed-order set of transformations, the techniques proposed in this thesis are capable of tackling the general problem of predicting "good" sequences of compiler transformations. This is achieved by exploiting transference across programs with two different techniques: Predictive Search Distributions (PSD) and multi-task Gaussian process prediction (multi-task GP). While the former directly addresses the problem of predicting "good" transformation sequences, the latter learns regression models (or proxies) of the performance of the programs in order to rapidly scan the space of transformation sequences.Both methods, PSD and multi-task GP, are formulated as general machine learning techniques. In particular, the PSD method is proposed in order to speed up search in combinatorial optimisation problems by learning a distribution over good solutions on a set of problem in¬ stances and using that distribution to search the optimisation space of a problem that has not been seen before. Likewise, multi-task GP is proposed as a general method for multi-task learning that directly models the correlation between several machine learning tasks, exploiting the shared information across the tasks.Additionally, this thesis presents an extension to the well-known analysis of variance (ANOVA) methodology in order to deal with sequence data. This extension is used to address the problem of optimisation space characterisation by identifying and quantifying the main effects of program transformations and their interactions.Finally, the machine learning methods proposed are successfully applied to a data set that has been generated as a result of the application of source-to-source transformations to 12 C programs from the UTDSP benchmark suite

    Enhanced applicability of loop transformations

    Get PDF

    Analytical cost metrics: days of future past

    Get PDF
    2019 Summer.Includes bibliographical references.Future exascale high-performance computing (HPC) systems are expected to be increasingly heterogeneous, consisting of several multi-core CPUs and a large number of accelerators, special-purpose hardware that will increase the computing power of the system in a very energy-efficient way. Specialized, energy-efficient accelerators are also an important component in many diverse systems beyond HPC: gaming machines, general purpose workstations, tablets, phones and other media devices. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. This work builds analytical cost models for cost metrics such as time, energy, memory access, and silicon area. These models are used to predict the performance of applications, for performance tuning, and chip design. The idea is to work with domain specific accelerators where analytical cost models can be accurately used for performance optimization. The performance optimization problems are formulated as mathematical optimization problems. This work explores the analytical cost modeling and mathematical optimization approach in a few ways. For stencil applications and GPU architectures, the analytical cost models are developed for execution time as well as energy. The models are used for performance tuning over existing architectures, and are coupled with silicon area models of GPU architectures to generate highly efficient architecture configurations. For matrix chain products, analytical closed form solutions for off-chip data movement are built and used to minimize the total data movement cost of a minimum op count tree

    Empirically Tuning HPC Kernels with iFKO

    Get PDF
    iFKO (iterative Floating point Kernel Optimizer) is an open-source iterative empirical compilation framework which can be used to tune high performance computing (HPC) kernels. The goal of our research is to advance iterative empirical compilation to the degree that the performance it can achieve is comparable to that delivered by painstaking hand tuning in assembly. This will allow many HPC researchers to spend precious development time on higher level aspects of tuning such as parallelization, as well as enabling computational scientists to develop new algorithms that demand new high performance kernels. At present, algorithms that cannot use hand-tuned performance libraries tend to lose to even inferior algorithms that can. We discuss our new autovectorization technique (speculative vectorization) which can autovectorize loops past dependent branches by speculating along frequently taken paths, even when other paths cannot be effectively vectorized. We implemented this technique in iFKO and demonstrated significant speedup for kernels that prior vectorization techniques could not optimize. We have developed an optimization for two dimensional array indexing that is critical for allowing us to heavily unroll and jam loops without restriction from integer register pressure. We then extended the state of the art single basic block vectorization method, SLP, to vectorize nested loops. We have also introduced optimized reductions that can retain full SIMD parallelization for the entire reduction, as well as doing loop specialization and unswitching as needed to address vector alignment issues and paths inside the loops which inhibit autovectorization. We have also implemented a critical transformation for optimal vectorization of mixed-type data. Combining all these techniques we can now fully vectorize the loopnests for our most complicated kernels, allowing us to achieve performance very close to that of hand-tuned assembly
    • …
    corecore