Search CORE

547 research outputs found

Machine Learning in Compiler Optimization

Author: O'Boyle Michael
Wang Zheng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/05/2018
Field of study

In the last decade, machine learning based compilation has moved from an an obscure research niche to a mainstream activity. In this article, we describe the relationship between machine learning and compiler optimisation and introduce the main concepts of features, models, training and deployment. We then provide a comprehensive survey and provide a road map for the wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This paper provides both an accessible introduction to the fast moving area of machine learning based compilation and a detailed bibliography of its main achievements

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

Lancaster E-Prints

Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2015
Field of study

Crossref

RICIS Symposium 1988

Author
Publication venue
Publication date
Field of study

Integrated Environments for Large, Complex Systems is the theme for the RICIS symposium of 1988. Distinguished professionals from industry, government, and academia have been invited to participate and present their views and experiences regarding research, education, and future directions related to this topic. Within RICIS, more than half of the research being conducted is in the area of Computer Systems and Software Engineering. The focus of this research is on the software development life-cycle for large, complex, distributed systems. Within the education and training component of RICIS, the primary emphasis has been to provide education and training for software professionals

NASA Technical Reports Server

PyCUDA and PyOpenCL: A Scripting-Based Approach to GPU Run-Time Code Generation

Author: Ahmed Fasih
Andreas Klöckner
Bell
Bryan Catanzaro
Buck
Chandler
Dalcín
Eich
Feldman
Flanagan
Frigo
Group
Hestenes
Hesthaven
Kennedy
Klöckner
Lam
Langtangen
Lindholm
McCarthy
McCool
Nicolas Pinto
Oliphant
Owens
Paul Ivanov
Pinto
Pinto
Prud’homme
Reynders
Seiler
Stein
Valiant
van Hateren
Veldhuizen
Wang
Whaley
Yunsup Lee
Publication venue: 'Elsevier BV'
Publication date: 29/03/2011
Field of study

High-performance computing has recently seen a surge of interest in heterogeneous systems, with an emphasis on modern Graphics Processing Units (GPUs). These devices offer tremendous potential for performance and efficiency in important large-scale applications of computational science. However, exploiting this potential can be challenging, as one must adapt to the specialized and rapidly evolving computing environment currently exhibited by GPUs. One way of addressing this challenge is to embrace better techniques and develop tools tailored to their needs. This article presents one simple technique, GPU run-time code generation (RTCG), along with PyCUDA and PyOpenCL, two open-source toolkits that support this technique. In introducing PyCUDA and PyOpenCL, this article proposes the combination of a dynamic, high-level scripting language with the massive performance of a GPU as a compelling two-tiered computing platform, potentially offering significant performance and productivity advantages over conventional single-tier, static systems. The concept of RTCG is simple and easily implemented using existing, robust infrastructure. Nonetheless it is powerful enough to support (and encourage) the creation of custom application-specific tools by its users. The premise of the paper is illustrated by a wide range of examples where the technique has been applied with considerable success.Comment: Submitted to Parallel Computing, Elsevie

arXiv.org e-Print Archive

Crossref

Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU

Author: Zhang Zheng
Publication venue: W&M ScholarWorks
Publication date: 01/01/2012
Field of study

Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware development: the massive parallelism of emerging computational accelerators such as Graphic Processing Units (GPU), and the non-uniformity of cache sharing in modern multicore processors. They are respectively driven by the important role of accelerators in today\u27s general-purpose computing and the ultimate importance of memory performance. This dissertation particularly concentrates on optimizing control flows and memory references, at both compilation and execution time, to tap into the full potential of pure software solutions in taking advantage of the two key hardware features.;Conditional branches cause divergences in program control flows, which may result in serious performance degradation on massively data-parallel GPU architectures with Single Instruction Multiple Data (SIMD) parallelism. On such an architecture, control divergence may force computing units to stay idle for a substantial time, throttling system throughput by orders of magnitude. This dissertation provides an extensive exploration of the solution to this problem and presents program level transformations based upon two fundamental techniques --- thread relocation and data relocation. These two optimizations provide fundamental support for swapping jobs among threads so that the control flow paths of threads converge within every SIMD thread group.;In memory performance, this dissertation concentrates on two aspects: the influence of nonuniform sharing on multithreading applications, and the optimization of irregular memory references on GPUs. In shared cache multicore chips, interactions among threads are complicated due to the interplay of cache contention and synergistic prefetching. This dissertation presents the first systematic study on the influence of non-uniform shared cache on contemporary parallel programs, reveals the mismatch between the software development and underlying cache sharing hierarchies, and further demonstrates it by proposing and applying cache-sharing-aware data transformations that bring significant performance improvement. For the second aspect, the efficiency of GPU accelerators is sensitive to irregular memory references, which refer to the memory references whose access patterns remain unknown until execution time (e.g., A[P[i]]). The root causes of the irregular memory reference problem are similar to that of the control flow problem, while in a more general and complex form. I developed a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations

College of William & Mary: W&M Publish

Dynamic computation migration in distributed shared memory systems

Author: Hsieh Wilson Cheng-Yi
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/1995
Field of study

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1995.Vita.Includes bibliographical references (p. 123-131).by Wilson Cheng-Yi Hsieh.Ph.D

CiteSeerX

DSpace@MIT

A Novel Thread Scheduler Design for Polymorphic Embedded Systems

Author: Krishnamurthy Viswanath
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2010
Field of study

A novel thread scheduler design for polymorphic embedded systems Abstract: The ever-increasing complexity of current day embedded systems necessitates that these systems be adaptable and scalable to user demands. With the growing use of consumer electronic devices, embedded computing is steadily approaching the desktop computing trend. End users expect their consumer electronic devices to operate faster than before and offer support for a wide range of applications. In order to accommodate a broad range of user applications, the challenge is to come up with an efficient design for the embedded system scheduler. Hence the primary goal of the thesis is to design a thread scheduler for a polymorphic thread computing embedded system. This is the first ever novel attempt at designing a polymorphic thread scheduler as none of the existing or conventional schedulers have accounted for thread polymorphism. To summarize the thesis work, a dynamic thread scheduler for a Multiple Application, Multithreaded polymorphic system has been implemented with User satisfaction as its objective function. The sigmoid function helps to accurately model end user perception in an embedded system as opposed to the conventional systems where the objective is to maximize/minimize the performance metric such as performance, power, energy etc. The Polymorphic thread scheduler framework which operates in a dynamic environment with N multithreaded applications has been explained and evaluated. Randomly generated Application graphs are used to test the Polymorphic scheduler framework. The benefits obtained by using User Satisfaction as the objective function and the performance enhancements obtained using the novel thread scheduler are demonstrated clearly using the result graphs. The advantages of the proposed greedy thread scheduling algorithm are demonstrated by comparison against conventional thread scheduling approaches like First Come First Serve (FCFS) and priority scheduling schemes

Digital Repository @ Iowa State University (ISU)

Collective Mind: Towards Practical and Collaborative Auto-Tuning

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2014
Field of study

Crossref