Search CORE

1,004 research outputs found

Run-time optimization of adaptive irregular applications

Author: Yu Hao
Publication venue: Texas A&M University
Publication date: 15/11/2004
Field of study

Compared to traditional compile-time optimization, run-time optimization could oﬀer signiﬁcant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identiﬁed a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Speciﬁcally, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm speciﬁed by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems

Texas A&M Repository

Run-time optimization of adaptive irregular applications

Author: Yu Hao
Publication venue: Texas A&M University
Publication date: 15/11/2004
Field of study

Texas A&M Repository

A Comparative Analysis of STM Approaches to Reduction Operations in Irregular Applications

Author: Bienia
Blume
Dai Wang
DeVuyst
Feautrier
Felber
Foster
Gonzalez-Mesa
Gonzalez-Mesa
González
Gottschlich
Gutiérrez
Hall
Han
Han
Harris
Herlihy
Johnson
Mehrara
Morales
Mukherjee
Perez
Porter
Prabhu
Quislant
Rauchwerger
Rauchwerger
Ruan
Udupa
von Praun
Yu
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

As a recently consolidated paradigm for optimistic concurrency in modern multicore architectures, Transactional Memory (TM) can help to the exploitation of parallelism in irregular applications when data dependence information is not available up to run- time. This paper presents and discusses how to leverage TM to exploit parallelism in an important class of irregular applications, the class that exhibits irregular reduction patterns. In order to test and compare our techniques with other solutions, they were implemented in a software TM system called ReduxSTM, that acts as a proof of concept. Basically, ReduxSTM combines two major ideas: a sequential-equivalent ordering of transaction commits that assures the correct result, and an extension of the underlying TM privatization mechanism to reduce unnecessary overhead due to reduction memory updates as well as unnecesary aborts and rollbacks. A comparative study of STM solutions, including ReduxSTM, and other more classical approaches to the parallelization of reduction operations is presented in terms of time, memory and overhead.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

Crossref

Repositorio Institucional Universidad de Málaga

ReduxSTM: Optimizing STM designs for Irregular Applications

Author: Gutierrez-Carrasco Eladio Damian
Pedrero Luque Manuel
Plata-Gonzalez Oscar Guillermo
Romero-Montiel Sergio
Publication venue: 'Elsevier BV'
Publication date: 15/11/2018
Field of study

Repositorio Institucional Universidad de Málaga

The Potential of Synergistic Static, Dynamic and Speculative Loop Nest Optimizations for Automatic Parallelization

Author: Baghdadi Riyadh
Bastoul Cedric
Cohen Albert
Pouchet Louis-Noel
Rauchwerger Lawrence
Publication venue
Publication date: 01/01/2010
Field of study

Research in automatic parallelization of loop-centric programs started with static analysis, then broadened its arsenal to include dynamic inspection-execution and speculative execution, the best results involving hybrid static-dynamic schemes. Beyond the detection of parallelism in a sequential program, scalable parallelization on many-core processors involves hard and interesting parallelism adaptation and mapping challenges. These challenges include tailoring data locality to the memory hierarchy, structuring independent tasks hierarchically to exploit multiple levels of parallelism, tuning the synchronization grain, balancing the execution load, decoupling the execution into thread-level pipelines, and leveraging heterogeneous hardware with specialized accelerators. The polyhedral framework allows to model, construct and apply very complex loop nest transformations addressing most of the parallelism adaptation and mapping challenges. But apart from hardware-specific, back-end oriented transformations (if-conversion, trace scheduling, value prediction), loop nest optimization has essentially ignored dynamic and speculative techniques. Research in polyhedral compilation recently reached a significant milestone towards the support of dynamic, data-dependent control flow. This opens a large avenue for blending dynamic analyses and speculative techniques with advanced loop nest optimizations. Selecting real-world examples from SPEC benchmarks and numerical kernels, we make a case for the design of synergistic static, dynamic and speculative loop transformation techniques. We also sketch the embedding of dynamic information, including speculative assumptions, in the heart of affine transformation search spaces

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

INRIA a CCSD electronic archive server

HAL-Rennes 1

Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques

Author: Kaiser Hartmut
Khatami Zahra
Ramanujam J.
Publication venue
Publication date: 27/03/2017
Field of study

Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalances and waiting time due to memory latencies. Compiler optimization is one of the most effective solutions to tackle this problem. The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided with a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and producing desired application scalability. One solution to address this challenge is the use of runtime methods. This strategy can be implemented by delaying certain amount of code analysis to be done at runtime. In this research, we improve the parallel application performance generated by the OP2 compiler by leveraging HPX, a C++ runtime system, to provide runtime optimizations. These optimizations include asynchronous tasking, loop interleaving, dynamic chunk sizing, and data prefetching. The results of the research were evaluated using an Airfoil application which showed a 40-50% improvement in parallel performance.Comment: 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017

arXiv.org e-Print Archive

Crossref

Factoring out ordered sections to expose thread-level parallelism

Author: De Bosschere Koen
Rul Sean
Vandierendonck Hans
Publication venue
Publication date: 01/01/2009
Field of study

With the rise of multi-core processors, researchers are taking a new look at extending the applicability auto-parallelization techniques. In this paper, we identify a dependence pattern on which autoparallelization currently fails. This dependence pattern occurs for ordered sections, i.e. code fragments in a loop that must be executed atomically and in original program order. We discuss why these ordered sections prohibit current auto-parallelizers from working and we present a technique to deal with them. We experimentally demonstrate the efficacy of the technique, yielding significant overall program speedups

Queen's University Belfast Research Portal

Ghent University Academic Bibliography

Loopapalooza: Investigating Limits of Loop-Level Parallelism with a Compiler-Driven Approach

Author: Gabrielli Giacomo
Iordanou Konstantinos
Luján Mikel
Zaidi Ali
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/04/2021
Field of study

The University of Manchester - Institutional Repository