Search CORE

4 research outputs found

The Potential of Synergistic Static, Dynamic and Speculative Loop Nest Optimizations for Automatic Parallelization

Author: Baghdadi Riyadh
Bastoul Cedric
Cohen Albert
Pouchet Louis-Noel
Rauchwerger Lawrence
Publication venue
Publication date: 01/01/2010
Field of study

Research in automatic parallelization of loop-centric programs started with static analysis, then broadened its arsenal to include dynamic inspection-execution and speculative execution, the best results involving hybrid static-dynamic schemes. Beyond the detection of parallelism in a sequential program, scalable parallelization on many-core processors involves hard and interesting parallelism adaptation and mapping challenges. These challenges include tailoring data locality to the memory hierarchy, structuring independent tasks hierarchically to exploit multiple levels of parallelism, tuning the synchronization grain, balancing the execution load, decoupling the execution into thread-level pipelines, and leveraging heterogeneous hardware with specialized accelerators. The polyhedral framework allows to model, construct and apply very complex loop nest transformations addressing most of the parallelism adaptation and mapping challenges. But apart from hardware-specific, back-end oriented transformations (if-conversion, trace scheduling, value prediction), loop nest optimization has essentially ignored dynamic and speculative techniques. Research in polyhedral compilation recently reached a significant milestone towards the support of dynamic, data-dependent control flow. This opens a large avenue for blending dynamic analyses and speculative techniques with advanced loop nest optimizations. Selecting real-world examples from SPEC benchmarks and numerical kernels, we make a case for the design of synergistic static, dynamic and speculative loop transformation techniques. We also sketch the embedding of dynamic information, including speculative assumptions, in the heart of affine transformation search spaces

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

INRIA a CCSD electronic archive server

HAL-Rennes 1

Runtime Dependence Computation and Execution of Loops on Heterogeneous Systems

Author: Jayvant Anantpur
R Govindarajan
Publication venue
Publication date: 24/04/2020
Field of study

Abstract GPUs have been used for parallel execution of DOALL loops. However, loops with indirect array references can potentially cause cross iteration dependences which are hard to detect using existing compilation techniques. Applications with such loops cannot easily use the GPU and hence do not benefit from the tremendous compute capabilities of GPUs. In this paper, we present an algorithm to compute at runtime the cross iteration dependences in such loops. The algorithm uses both the CPU and the GPU to compute the dependences. Specifically, it effectively uses the compute capabilities of the GPU to quickly collect the memory accesses performed by the iterations by executing the slice functions generated for the indirect array accesses. Using the dependence information, the loop iterations are levelized such that each level contains independent iterations which can be executed in parallel. Another interesting aspect of the proposed solution is that it pipelines the dependence computation of the future level with the actual computation of the current level to effectively utilize the resources available in the GPU. We use NVIDIA Tesla C2070 to evaluate our implementation using benchmarks from Polybench suite and some synthetic benchmarks. Our experiments show that the proposed technique can achieve an average speedup of 6.4x on loops with a reasonable number of cross iteration dependences

CiteSeerX

Logical Inference Techniques for Loop Parallelization

Author: Cosmin E. Oancea
Publication venue
Publication date
Field of study

This paper presents a fully automatic approach to loop parallelization that integrates the use of static and run-time analysis and thus overcomes many known difficulties such as nonlinear and indirect array indexing and complex control flow. Our hybrid analysis framework validates the parallelization transformation by verifying the independence of the loop’s memory references. To this end it represents array references using the USR (uniform set representation) language and expresses the independence condition as an equation, S = ∅, where S is a set expression representing array indexes. Using a language instead of an array-abstraction representation for S results in a smaller number of conservative approximations but exhibits a potentially-high runtime cost. To alleviate this cost we introduce a language translation F from the USR set-expression language to an equally rich language of predicates (F(S) ⇒ S = ∅). Loop parallelization is then validated using a novel logic inference algorithm that factorizes the obtained complex predicates (F(S)) into a sequence of sufficient-independence conditions that are evaluated first statically and, when needed, dynamically, in increasing order of their estimated complexities. We evaluate our automated solution on 26 benchmarks from PERFECT-CLUB and SPEC suites and show that our approach is effective in parallelizing large, complex loops and obtains much better full program speedups than the Intel and IBM Fortran compilers

CiteSeerX

Recommended from our members

Dynamic and Dual Streaming Methods for H.264 Video and Parallel Performance Modeling

Author: Johnson Taylor N.
Publication venue: 'Oregon State University'
Publication date
Field of study

Traditional approaches to streaming H.264 video over a network typically rely on a single method of transport (i.e., reliable or unreliable) and/or use static values for parameters that can have a significant negative impact on the perceptual quality of the received video. This dissertation presents a dynamic method for wireless channel selection during video streaming, and explores the latency and QoE improvements yielded by the FDSP dual streaming method. The increased workload that results from these dynamic methods can lead to a counterproductive impairment of streaming performance, and therefore requires efficient use of the multiple cores typically present in both sender and receiver (or server and client). This dissertation therefore presents a performance cost model which can be used to guide the parallelization of specific types of client or server-side streaming components -- specifically, programs containing non-DOALL loops that have inter-iteration data dependences which constrain their parallelism

ScholarsArchive@OSU