1,172 research outputs found

    Speculative parallelization of partially parallel loops

    Get PDF
    Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. In our previous work, we have speculatively executed a loop as a doall, and applied a fully parallel data dependence test to determine if it had any cross–processor depen- dences. If the test failed, then the loop was re–executed serially. While this method exploits doall parallelism well, it can cause slowdowns for loops with even one cross- processor flow dependence because we have to re-execute sequentially. Moreover, the existing, partial parallelism of loops is not exploited. We demonstrate a generalization of the speculative doall parallelization tech- nique, called the Recursive LRPD test, that can extract and exploit the maximum available parallelism of any loop and that limits potential slowdowns to the over- head of the run-time dependence test itself. In this thesis, we have presented the base algorithm and an analysis of the different heuristics for its practical applica- tion. To reduce the run-time overhead of the Recursive LRPD test, we have im- plemented on-demand checkpointing and commit, more efficient data dependence analysis and shadow structures, and feedback-guided load balancing. We obtained scalable speedups for loops from Track, Spice, and FMA3D that were not paralleliz- able by previous speculative parallelization methods

    A Comparative Analysis of STM Approaches to Reduction Operations in Irregular Applications

    Get PDF
    As a recently consolidated paradigm for optimistic concurrency in modern multicore architectures, Transactional Memory (TM) can help to the exploitation of parallelism in irregular applications when data dependence information is not available up to run- time. This paper presents and discusses how to leverage TM to exploit parallelism in an important class of irregular applications, the class that exhibits irregular reduction patterns. In order to test and compare our techniques with other solutions, they were implemented in a software TM system called ReduxSTM, that acts as a proof of concept. Basically, ReduxSTM combines two major ideas: a sequential-equivalent ordering of transaction commits that assures the correct result, and an extension of the underlying TM privatization mechanism to reduce unnecessary overhead due to reduction memory updates as well as unnecesary aborts and rollbacks. A comparative study of STM solutions, including ReduxSTM, and other more classical approaches to the parallelization of reduction operations is presented in terms of time, memory and overhead.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

    The Potential of Synergistic Static, Dynamic and Speculative Loop Nest Optimizations for Automatic Parallelization

    Get PDF
    Research in automatic parallelization of loop-centric programs started with static analysis, then broadened its arsenal to include dynamic inspection-execution and speculative execution, the best results involving hybrid static-dynamic schemes. Beyond the detection of parallelism in a sequential program, scalable parallelization on many-core processors involves hard and interesting parallelism adaptation and mapping challenges. These challenges include tailoring data locality to the memory hierarchy, structuring independent tasks hierarchically to exploit multiple levels of parallelism, tuning the synchronization grain, balancing the execution load, decoupling the execution into thread-level pipelines, and leveraging heterogeneous hardware with specialized accelerators. The polyhedral framework allows to model, construct and apply very complex loop nest transformations addressing most of the parallelism adaptation and mapping challenges. But apart from hardware-specific, back-end oriented transformations (if-conversion, trace scheduling, value prediction), loop nest optimization has essentially ignored dynamic and speculative techniques. Research in polyhedral compilation recently reached a significant milestone towards the support of dynamic, data-dependent control flow. This opens a large avenue for blending dynamic analyses and speculative techniques with advanced loop nest optimizations. Selecting real-world examples from SPEC benchmarks and numerical kernels, we make a case for the design of synergistic static, dynamic and speculative loop transformation techniques. We also sketch the embedding of dynamic information, including speculative assumptions, in the heart of affine transformation search spaces

    GPU-TLS: an efficient runtime for speculative loop parallelization on GPUs

    Get PDF
    Recently GPUs have risen as one important parallel platform for general purpose applications, both in HPC and cloud environments. Due to the special execution model, developing programs for GPUs is difficult even with the recent introduction of high-level languages like CUDA and OpenCL. To ease the programming efforts, some research has proposed automatically generating parallel GPU codes by complex compile-time techniques. However, this approach can only parallelize loops 100% free of inter-iteration dependencies (i.e., DOALL loops). To exploit runtime parallelism, which cannot be proven by static analysis, in this work, we propose GPU-TLS, a runtime system to speculatively parallelize possibly-parallel loops in sequential programs on GPUs. GPU-TLS parallelizes a possibly-parallel loop by chopping it into smaller sub-loops, each of which is executed in parallel by a GPU kernel, speculating that no inter-iteration dependencies exist. After dependency checking, the buffered writes of iterations without mis-speculations are copied to the master memory while iterations encountering mis-speculations are re-executed. GPU-TLS addresses several key problems of speculative loop parallelization on GPUs: (1) The larger mis-speculation rate caused by larger number of threads is reduced by three approaches: the loop chopping parallelization approach, the deferred memory update scheme and intra-warp value forwarding method. (2) The larger overhead of dependency checking is reduced by a hybrid scheme: eager intra-warp dependency checking combined with lazy inter-warp dependency checking. (3) The bottleneck of serial commit is alleviated by a parallel commit scheme, which allows different iterations to enter the commit phase out of order but still guarantees sequential semantics. Extensive evaluations using both microbenchmarks and reallife applications on two recent NVIDIA GPU cards show that speculative loop parallelization using GPU-TLS can achieve speedups ranging from 5 to 160 for sequential programs with possibly-parallel loops. © 2013 IEEE.published_or_final_versio

    Factoring out ordered sections to expose thread-level parallelism

    Get PDF
    With the rise of multi-core processors, researchers are taking a new look at extending the applicability auto-parallelization techniques. In this paper, we identify a dependence pattern on which autoparallelization currently fails. This dependence pattern occurs for ordered sections, i.e. code fragments in a loop that must be executed atomically and in original program order. We discuss why these ordered sections prohibit current auto-parallelizers from working and we present a technique to deal with them. We experimentally demonstrate the efficacy of the technique, yielding significant overall program speedups

    Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques

    Full text link
    Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalances and waiting time due to memory latencies. Compiler optimization is one of the most effective solutions to tackle this problem. The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided with a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and producing desired application scalability. One solution to address this challenge is the use of runtime methods. This strategy can be implemented by delaying certain amount of code analysis to be done at runtime. In this research, we improve the parallel application performance generated by the OP2 compiler by leveraging HPX, a C++ runtime system, to provide runtime optimizations. These optimizations include asynchronous tasking, loop interleaving, dynamic chunk sizing, and data prefetching. The results of the research were evaluated using an Airfoil application which showed a 40-50% improvement in parallel performance.Comment: 18th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2017
    • …
    corecore