Search CORE

311 research outputs found

Run-time optimization of adaptive irregular applications

Author: Yu Hao
Publication venue: Texas A&M University
Publication date: 15/11/2004
Field of study

Compared to traditional compile-time optimization, run-time optimization could oﬀer signiﬁcant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identiﬁed a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Speciﬁcally, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm speciﬁed by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems

Texas A&M Repository

An exploration of CUDA and CBEA for a gravitational wave data-analysis application (Einstein@Home)

Author: Breitbart Jens
Khanna Gaurav
Publication venue: 'Bentham Science Publishers Ltd.'
Publication date: 29/06/2009
Field of study

We present a detailed approach for making use of two new computer hardware architectures -- CBEA and CUDA -- for accelerating a scientific data-analysis application (Einstein@Home). Our results suggest that both the architectures suit the application quite well and the achievable performance in the same software developmental time-frame, is nearly identical.Comment: Accepted for publication in International Conference on Parallel Processing and Applied Mathematics (PPAM 2009

arXiv.org e-Print Archive

Crossref

Fast Matlab compatible sparse assembly on multicore computers

Author: Engblom Stefan
Lukarski Dimitar
Publication venue: 'Elsevier BV'
Publication date: 23/10/2015
Field of study

We develop and implement in this paper a fast sparse assembly algorithm, the fundamental operation which creates a compressed matrix from raw index data. Since it is often a quite demanding and sometimes critical operation, it is of interest to design a highly efficient implementation. We show how to do this, and moreover, we show how our implementation can be parallelized to utilize the power of modern multicore computers. Our freely available code, fully Matlab compatible, achieves about a factor of 5 times in speedup on a typical 6-core machine and 10 times on a dual-socket 16 core machine compared to the built-in serial implementation

arXiv.org e-Print Archive

CiteSeerX

Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques

Author: Kaiser Hartmut
Khatami Zahra
Ramanujam J.
Publication venue
Publication date: 27/03/2017
Field of study

Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalances and waiting time due to memory latencies. Compiler optimization is one of the most effective solutions to tackle this problem. The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided with a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and producing desired application scalability. One solution to address this challenge is the use of runtime methods. This strategy can be implemented by delaying certain amount of code analysis to be done at runtime. In this research, we improve the parallel application performance generated by the OP2 compiler by leveraging HPX, a C++ runtime system, to provide runtime optimizations. These optimizations include asynchronous tasking, loop interleaving, dynamic chunk sizing, and data prefetching. The results of the research were evaluated using an Airfoil application which showed a 40-50% improvement in parallel performance.Comment: 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017

arXiv.org e-Print Archive

Crossref

ReduxSTM: Optimizing STM designs for Irregular Applications

Author: Gutierrez-Carrasco Eladio Damian
Pedrero Luque Manuel
Plata-Gonzalez Oscar Guillermo
Romero-Montiel Sergio
Publication venue: 'Elsevier BV'
Publication date: 15/11/2018
Field of study

Repositorio Institucional Universidad de Málaga

Run-time optimization of adaptive irregular applications

Author: Yu Hao
Publication venue: Texas A&M University
Publication date: 15/11/2004
Field of study

Texas A&M Repository

APOLLO: Automatic speculative POLyhedral Loop Optimizer

Author: Baloian Artiom
Clauss Philippe
Martinez Caamaño Juan Manuel
Selva Manuel
Sukumaran-Rajam Aravind
Publication venue: HAL CCSD
Publication date: 23/01/2017
Field of study

International audienceA few weeks ago, we were glad to announce the first release of Apollo, the Automatic speculative POLyhedral Loop Opti-mizer. Apollo applies polyhedral optimizations on-the-fly to loop nests, whose control flow and memory access patterns cannot be determined at compile-time. In contrast to existing tools, Apollo can handle any kind of loop nest, whose memory accesses can be performed through pointers and in-directions. At runtime, Apollo builds a predictive polyhedral model, which is used for speculative optimization including parallelization. Being a dynamic system, Apollo can even apply the polyhedral model to nonlinear loops. This paper describes Apollo from the perspective of a user, as well as some of its main contributions and mechanisms, including the just-in-time polyhedral compilation, that significantly extends the scope of polyhedral techniques

INRIA a CCSD electronic archive server

Does dynamic and speculative parallelization enable advanced parallelizing and optimizing code transformations?

Author: Clauss Philippe
Jimborean Alexandra
Publication venue: HAL CCSD
Publication date: 23/01/2012
Field of study

International audienceThread-Level Speculation (TLS) is a dynamic and automatic parallelization strategy allowing to handle codes that cannot be parallelized at compile-time, because of insufficient information that can be extracted from the source code. However, the proposed TLS systems are strongly limited in the kind of parallelization they can apply on the original sequential code. Consequently, they often yield poor performance. In this paper, we give the main reasons of their limits and show that it is possible in some cases for a TLS system to handle more advanced parallelizing transformations. In particular, it is shown that codes characterized by phases where the memory behavior can be modeled by linear functions, can take advantage of a dynamic use of the polytope model

INRIA a CCSD electronic archive server

Software Support for Irregular and Loosely Synchronous Problems

Author: Choudhary Alok
Fox Geoffrey C.
Hiranandani Seema
Ranka Sanjay
Publication venue: SURFACE at Syracuse University
Publication date: 01/01/1992
Field of study

A large class of scientific and engineering applications may be classified as irregular and loosely synchronous from the perspective of parallel processing. We present a partial classification of such problems. This classification has motivated us to enhance Fortran D to provide language support for irregular, loosely synchronous problems. We present techniques for parallelization of such problems in the context of Fortran D

Syracuse University Research Facility and Collaborative Environment