4 research outputs found
The Potential of Synergistic Static, Dynamic and Speculative Loop Nest Optimizations for Automatic Parallelization
Research in automatic parallelization of loop-centric programs started with
static analysis, then broadened its arsenal to include dynamic
inspection-execution and speculative execution, the best results involving
hybrid static-dynamic schemes. Beyond the detection of parallelism in a
sequential program, scalable parallelization on many-core processors involves
hard and interesting parallelism adaptation and mapping challenges. These
challenges include tailoring data locality to the memory hierarchy, structuring
independent tasks hierarchically to exploit multiple levels of parallelism,
tuning the synchronization grain, balancing the execution load, decoupling the
execution into thread-level pipelines, and leveraging heterogeneous hardware
with specialized accelerators. The polyhedral framework allows to model,
construct and apply very complex loop nest transformations addressing most of
the parallelism adaptation and mapping challenges. But apart from
hardware-specific, back-end oriented transformations (if-conversion, trace
scheduling, value prediction), loop nest optimization has essentially ignored
dynamic and speculative techniques. Research in polyhedral compilation recently
reached a significant milestone towards the support of dynamic, data-dependent
control flow. This opens a large avenue for blending dynamic analyses and
speculative techniques with advanced loop nest optimizations. Selecting
real-world examples from SPEC benchmarks and numerical kernels, we make a case
for the design of synergistic static, dynamic and speculative loop
transformation techniques. We also sketch the embedding of dynamic information,
including speculative assumptions, in the heart of affine transformation search
spaces
Runtime Dependence Computation and Execution of Loops on Heterogeneous Systems
Abstract GPUs have been used for parallel execution of DOALL loops. However, loops with indirect array references can potentially cause cross iteration dependences which are hard to detect using existing compilation techniques. Applications with such loops cannot easily use the GPU and hence do not benefit from the tremendous compute capabilities of GPUs. In this paper, we present an algorithm to compute at runtime the cross iteration dependences in such loops. The algorithm uses both the CPU and the GPU to compute the dependences. Specifically, it effectively uses the compute capabilities of the GPU to quickly collect the memory accesses performed by the iterations by executing the slice functions generated for the indirect array accesses. Using the dependence information, the loop iterations are levelized such that each level contains independent iterations which can be executed in parallel. Another interesting aspect of the proposed solution is that it pipelines the dependence computation of the future level with the actual computation of the current level to effectively utilize the resources available in the GPU. We use NVIDIA Tesla C2070 to evaluate our implementation using benchmarks from Polybench suite and some synthetic benchmarks. Our experiments show that the proposed technique can achieve an average speedup of 6.4x on loops with a reasonable number of cross iteration dependences
Logical Inference Techniques for Loop Parallelization
This paper presents a fully automatic approach to loop parallelization that integrates the use of static and run-time analysis and thus overcomes many known difficulties such as nonlinear and indirect array indexing and complex control flow. Our hybrid analysis framework validates the parallelization transformation by verifying the independence of the loop’s memory references. To this end it represents array references using the USR (uniform set representation) language and expresses the independence condition as an equation, S = ∅, where S is a set expression representing array indexes. Using a language instead of an array-abstraction representation for S results in a smaller number of conservative approximations but exhibits a potentially-high runtime cost. To alleviate this cost we introduce a language translation F from the USR set-expression language to an equally rich language of predicates (F(S) ⇒ S = ∅). Loop parallelization is then validated using a novel logic inference algorithm that factorizes the obtained complex predicates (F(S)) into a sequence of sufficient-independence conditions that are evaluated first statically and, when needed, dynamically, in increasing order of their estimated complexities. We evaluate our automated solution on 26 benchmarks from PERFECT-CLUB and SPEC suites and show that our approach is effective in parallelizing large, complex loops and obtains much better full program speedups than the Intel and IBM Fortran compilers
Recommended from our members
Dynamic and Dual Streaming Methods for H.264 Video and Parallel Performance Modeling
Traditional approaches to streaming H.264 video over a network typically rely on a single method of transport (i.e., reliable or unreliable) and/or use static values for parameters that can have a significant negative impact on the perceptual quality of the received video. This dissertation presents a dynamic method for wireless channel selection during video streaming, and explores the latency and QoE improvements yielded by the FDSP dual streaming method.
The increased workload that results from these dynamic methods can lead to a counterproductive impairment of streaming performance, and therefore requires efficient use of the multiple cores typically present in both sender and receiver (or server and client). This dissertation therefore presents a performance cost model which can be used to guide the parallelization of specific types of client or server-side streaming components -- specifically, programs containing non-DOALL loops that have inter-iteration data dependences which constrain their parallelism