Search CORE

10 research outputs found

A comparison of online and offline strategies for program adaptation

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2007
Field of study

Automatic creation of tile size selection models using neural networks

Author: Yuki Tomofumi
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2010
Field of study

2010 Spring.Includes bibliographic references (pages 54-59).Covers not scanned.Print version deaccessioned 2022.Tiling is a widely used loop transformation for exposing/exploiting parallelism and data locality. Effective use of tiling requires selection and tuning of the tile sizes. This is usually achieved by hand-crafting tile size selection (TSS) models that characterize the performance of the tiled program as a function of tile sizes. The best tile sizes are selected by either directly using the TSS model or by using the TSS model together with an empirical search. Hand-crafting accurate TSS models is hard, and adapting them to different architecture/compiler, or even keeping them up-to-date with respect to the evolution of a single compiler is often just as hard. Instead of hand-crafting TSS models, can we automatically learn or create them? In this paper, we show that for a specific class of programs fairly accurate TSS models can be automatically created by using a combination of simple program features, synthetic kernels, and standard machine learning techniques. The automatic TSS model generation scheme can also be directly used for adapting the model and/or keeping it up-to-date. We evaluate our scheme on six different architecture-compiler combinations (chosen from three different architectures and four different compilers). The models learned by our method have consistently shown near-optimal performance (within 5% of the optimal on average) across the tested architecture-compiler combinations

Mountain Scholar (Digital Collections of Colorado and Wyoming)

Psort: automated code tuning

Author
Publication venue
Publication date
Field of study

This thesis describes the design and implementation of an automated code tuner for psort, a fast sorting library for large datasets. Our work, motivated by the necessity of guaranteeing a high performance while keeping a low cost on the end user, provides a reusable and portable framework that can be easily extended to automatically tune virtually every portion of the source code, including code that has not yet been written. Experiments show that our system produces code which is significantly faster than original code, suggesting that psort should include it among its tools SOMMARIO Questa tesi descrive la progettazione e la realizzazione di un ottimizzatore di codice automatico per psort, una libreria di ordinamento veloce per grandi moli di dati. Il nostro lavoro, motivato dalla necessità di garantire alte prestazioni mantenendo un basso costo sull'utente finale, fornisce una infrastruttura rius- abile e portabile che può essere facilmente estesa per ottimizzare in maniera automatica virtualmente ogni porzione di codice sorgente, incluso codice che ancora non è stato scritto. Gli esperimenti mostrano che il nostro sistema pro- duce codice che è significativamente più veloce del codice originale, suggerendo che psort dovrebbe includerlo tra i suoi strument

Padua Thesis and Dissertation Archive

Analytical Bounds for Optimal Tile Size Selection

Author: C. Hsu
J. Ferrante
J. Ramanujam
J. Xue
J.A. Nelder
M. Luersen
P. Boulet
P.M.W. Knijnenburg
R.C. Whaley
S. Ghosh
T.W. Barr
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Crossref

Combining Prior Knowledge and Data: Beyond the Bayesian Framework

Author: Epshteyn Arkady
Publication venue
Publication date: 01/04/2007
Field of study

For many tasks such as text categorization and control of robotic systems, state-of-the art learning systems can produce results comparable in accuracy to those of human subjects. However, the amount of training data needed for such systems can be prohibitively large for many practical problems. A text categorization system, for example, may need to see many text postings manually tagged with their subjects before it learns to predict the subject of the next posting with high accuracy. A reinforcement learning (RL) system learning how to drive a car needs a lot of experimentation with the actual car before acquiring the optimal policy. An optimizing compiler targeting a certain platform has to construct, compile, and execute many versions of the same code with different optimization parameters to determine which optimizations work best. Such extensive sampling can be time-consuming, expensive (in terms of both expense of the human expertise needed to label data and wear and tear on the robotic equipment used for exploration in case of RL), and sometimes dangerous (e.g., an RL agent driving the car off the cliff to see if it survives the crash). The goal of this work is to reduce the amount of training data an agent needs in order to learn how to perform a task successfully. This is done by providing the system with prior knowledge about its domain. The knowledge is used to bias the agent towards useful solutions and limit the amount of training needed. We explore this task in three contexts: classification (determining the subject of a newsgroup posting), control (learning to perform tasks such as driving a car up the mountain in simulation), and optimization (optimizing performance of linear algebra operations on different hardware platforms). For the text categorization problem, we introduce a novel algorithm which efficiently integrates prior knowledge into large margin classification. We show that prior knowledge simplifies the problem by reducing the size of the hypothesis space. We also provide formal convergence guarantees for our algorithm. For reinforcement learning, we introduce a novel framework for defining planning problems in terms of qualitative statements about the world (e.g., ``the faster the car is going, the more likely it is to reach the top of the mountain''). We present an algorithm based on policy iteration for solving such qualitative problems and prove its convergence. We also present an alternative framework which allows the user to specify prior knowledge quantitatively in form of a Markov Decision Process (MDP). This prior is used to focus exploration on those regions of the world in which the optimal policy is most sensitive to perturbations in transition probabilities and rewards. Finally, in the compiler optimization problem, the prior is based on an analytic model which determines good optimization parameters for a given platform. This model defines a Bayesian prior which, combined with empirical samples (obtained by measuring the performance of optimized code segments), determines the maximum-a-posteriori estimate of the optimization parameters

Illinois Digital Environment for Access to Learning and Scholarship Repository

Predictive Modeling in a Polyhedral Optimization Space

Author: Bastoul Cédric
Cavazos John
Cohen Albert
Park Eunjung
Pouchet Louis-Noël
Sadayappan P.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

International audienceHigh-level program optimizations, such as loop transformations, are critical for high performance on multi-core targets. However, complex sequences of loop transformations are often required to expose parallelism (both coarse-grain and fine-grain) and improve data locality. The polyhedral compilation framework has proved to be very effective at representing these complex sequences and restructuring compute-intensive applications, seamlessly handling perfectly and imperfectly nested loops. Nevertheless identifying the most effective loop transformations remains a major challenge. We address the problem of selecting the best polyhedral optimizations with dedicated machine learning models, trained specifically on the target machine. We show that these models can quickly select high-performance optimizations with very limited iterative search. Our end-to-end framework is validated using numerous benchmarks on two modern multi-core platforms. We investigate a variety of different machine learning algorithms and hardware counters, and we obtain performance improvements over productions compilers ranging on average from 3.2x to 8.7x, by running not more than 6 program variants from a polyhedral optimization space

Crossref

INRIA a CCSD electronic archive server

HAL-Rennes 1

Recommended from our members

Efficient Generation of Sequences of Dense Linear Algebra through Auto-Tuning

Author: Belter Geoffrey
Publication venue: CU Scholar
Publication date: 01/01/2012
Field of study

It is rare for a programmer to solve a numerical problem with a single library call; most problems require a sequence of calls. In the case of linear algebra, programmers will chain a series of Basic Linear Algebra Subprogram (BLAS) library calls to achieve the desired result. When a sequence of BLAS calls is memory bound, a great deal of performance is missed because optimization has not occurred between library routines. It is not practical to create a library with every required sequence of linear algebra operations, but at the same time it is difficult for programmers to write their own high performance implementation. One solution is for programmers to use an auto-tuning tool capable of optimizing the sequence of operations that exactly suits their need. This thesis presents a matrix representation and type system that describes basic linear algebra operations, the loops required to implement those operations, and the legality of key optimizations. This is demonstrated in an auto-tuning tool which generates loops and performs data parallelism and loop fusion. Results show that this approach can match or exceed performance of vendor tuned BLAS libraries, general purpose optimizing compilers, and hand written code. Further, this approach is shown to be both portable and work with a range of dense matrix storage formats. All of this is achieved with search times in the range of several minutes to a few hours

CU Scholar Institutional Repository

Recommended from our members

Runtime Prediction of Fused Linear Algebra in a Compiler Framework

Author: Karlin Ian
Publication venue: CU Scholar
Publication date: 01/01/2011
Field of study

On modern processors, data transfer exceeds floating-point operations as the predominant cost in many linear algebra computations. For these memory-bound calculations, reducing data movement is often the only way to significantly increase their speed. One tuning technique that focuses on reducing memory accesses is loop fusion. However, determining the optimum amount of loop fusion to apply to a routine is difficult as fusion can both positively and negatively impact memory traffic. In this thesis, we perform an in depth analysis of how loop fusion affects data movement throughout the memory hierarchy. The results of this analysis are used to create a memory model for fused linear algebra calculations. The model predicts data movement throughout the memory hierarchy. Included in its design are runtime and accuracy tradeoffs based on our fusion research. The model\u27s memory traffic predictions are converted to runtime estimates that can be used to compare loop fusion variants on serial and shared memory parallel machines. We integrate our model into a compiler where its predictions often reduce compile times by 99% or more. The kernel produced by the compiler with the model turned on are usually the same as the optimal kernel for the target architecture found by exhaustively testing all possible loop fusion combinations

CU Scholar Institutional Repository

Think Globally, Search Locally

Author: Pingali Keshav
Stodghill Paul
Yotov Kamen
Publication venue: 'SAGE Publications'
Publication date: 04/11/2004
Field of study

A key step in program optimization is the determination of optimal values for code optimization parameters such as cache tile sizes and loop unrolling factors. One approach, which is implemented in most compilers, is to use analytical models to determine these values. The other approach, used in library generators like ATLAS, is to perform a global search over the space of parameter values by generating different versions of the code and executing them on the actual machine to find the parameter values that give the best performance. Neither approach is suitable for use in general-purpose compilers that must generate high quality code for large programs running on complex architectures. Model-driven optimization may incur a performance penalty of 10-20\% even for a relatively simple code like matrix multiplication, as was shown recently by Yotov et al. On the other hand, global search is not tractable for optimizing large programs for complex architectures because the optimization space is too large. To address this problem, some researchers are exploring more sophisticated search algorithms such as the simplex method, but it remains to be seen if these methods are successful in reducing search time without compromising on the quality of the solution. In this paper, we advocate a different methodology for generating high-performance code without increasing search time dramatically. Our methodology has three components: (i) modeling, (ii) local search, and (iii) model refinement. We use analytical models to estimate optimal values for transformation parameters. Since it is impossible to build tractable analytical models that capture all the features of complex architectures, we advocate improving these estimates by using a local search in the neighborhood of the model-predicted values. Finally, if the performance gap between handwritten code and generated code is substantial on some architecture, we advocate model refinement. To demonstrate this methodology, we built a modified ATLAS system that used a simple analytical model and local search, and showed that on most architectures, the performance of the code produced by this system was comparable to that of code produced by the original ATLAS system using global search. However, on x86 architectures, the gap in performance was substantial, and could not be bridged by local search alone. We argue that the problem is that the model assumed aggressive operation scheduling to mask instruction latencies, but such scheduling can actually be harmful on x86 architectures, a somewhat surprising fact that does not appear to be known widely. To address this problem, we use model refinement to generate a more sophisticated model that, when combined with local search, enables the production of high-quality code on both RISC and CISC architectures

Crossref

eCommons@Cornell