444 research outputs found

    Source-to-Source Transformations for Parallel Optimizations in STAPL

    Get PDF
    Programs that use the STAPL C++ parallel programming library express their control and data flow explicitly through the use of skeletons. Skeletons can be simple parallel operations like map and reduce, or the result of composing several skeletons. Composition is implemented by tracking the dependencies among individual data elements in the STAPL runtime system. However, the operations and dependencies within a compose skeleton can be determined at compile time from the C++ abstract syntax tree. This enables the use of source-to-source transformations to fuse the composed skeletons. Transformations can also be used to replace skeletons entirely with equivalent code. Both transformations greatly reduce STAPL runtime overhead, and zip fusion also allows a compiler to optimize the work functions as a single unit. We present a Clang compiler plugin and wrapper that automatically perform these transformations, and demonstrate its ability to improve performance

    Automatic translation of non-repetitive OpenMP to MPI

    Get PDF
    Cluster platforms with distributed-memory architectures are becoming increasingly available low-cost solutions for high performance computing. Delivering a productive programming environment that hides the complexity of clusters and allows writing efficient programs is urgently needed. Despite multiple efforts to provide shared memory abstraction, message-passing (MPI) is still the state-of-the-art programming model for distributed-memory architectures. ^ Writing efficient MPI programs is challenging. In contrast, OpenMP is a shared-memory programming model that is known for its programming productivity. Researchers introduced automatic source-to-source translation schemes from OpenMP to MPI so that programmers can use OpenMP while targeting clusters. Those schemes limited their focus on OpenMP programs with repetitive communication patterns (where the analysis of communication can be simplified). This dissertation reduces this limitation and presents a novel OpenMP-to-MPI translation scheme that covers OpenMP programs with both repetitive and non-repetitive communication patterns. We target laboratory-size clusters of ten to hundred nodes (commonly found in research laboratories and small enterprises). ^ With our translation scheme, six non-repetitive and four repetitive OpenMP benchmarks have been efficiently scaled to a cluster of 64 cores. By contrast, the state-of-the-art translator scaled only the four repetitive benchmarks. In addition, our translation scheme was shown to outperform or perform as well as the state-of-the-art translator. We also compare the translation scheme with available hand-coded MPI and Unified Parallel C (UPC) programs

    Doctor of Philosophy

    Get PDF
    dissertationHigh Performance Computing (HPC) on-node parallelism is of extreme importance to guarantee and maintain scalability across large clusters of hundreds of thousands of multicore nodes. HPC programming is dominated by the hybrid model "MPI + X", with MPI to exploit the parallelism across the nodes, and "X" as some shared memory parallel programming model to accomplish multicore parallelism across CPUs or GPUs. OpenMP has become the "X" standard de-facto in HPC to exploit the multicore architectures of modern CPUs. Data races are one of the most common and insidious of concurrent errors in shared memory programming models and OpenMP programs are not immune to them. The OpenMP-provided ease of use to parallelizing programs can often make it error-prone to data races which become hard to find in large applications with thousands lines of code. Unfortunately, prior tools are unable to impact practice owing to their poor coverage or poor scalability. In this work, we develop several new approaches for low overhead data race detection. Our approaches aim to guarantee high precision and accuracy of race checking while maintaining a low runtime and memory overhead. We present two race checkers for C/C++ OpenMP programs that target two different classes of programs. The first, ARCHER, is fast but requires large amount of memory, so it ideally targets applications that require only a small portion of the available on-node memory. On the other hand, SWORD strikes a balance between fast zero memory overhead data collection followed by offline analysis that can take a long time, but it often report most races quickly. Given that race checking was impossible for large OpenMP applications, our contributions are the best available advances in what is known to be a difficult NP-complete problem. We performed an extensive evaluation of the tools on existing OpenMP programs and HPC benchmarks. Results show that both tools guarantee to identify all the races of a program in a given run without reporting any false alarms. The tools are user-friendly, hence serve as an important instrument for the daily work of programmers to help them identify data races early during development and production testing. Furthermore, our demonstrated success on real-world applications puts these tools on the top list of debugging tools for scientists at large

    The Cost and Benefits of Coordination Programming: Two Case Studies in Concurrent Collection and S-Net

    Get PDF
    Electronic version of an article published as Pavel Zaichenkov et al, Parallel Processing Letters, Vol. 26 (3), 2016, 24 pages. DOI: http://www.worldscientific.com/doi/abs/10.1142/S0129626416500110 © 2016 World Scientific Publishing Company http://www.worldscientific.com/worldscinet/pplThis is an evaluation study of the expressiveness provided and the performance delivered by the coordination language S-NET in comparison to Intel’s Concurrent Collections (CnC). An S-NET application is a network of black-box compute components connected through anonymous data streams, with the standard input and output streams linking the application to the environment. Our case study is based on two applications: a face detection algorithm implemented as a pipeline of feature classifiers and a numerical algorithm from the linear algebra domain, namely Cholesky decomposition. The selected applications are representative and have been selected by Intel researchers as evaluation testbeds for CnC in the past. We implement various versions of both algorithms in S-NET and compare them with equivalent CnC implementations, both with and without tuning, previously published by the CnC community. Our experiments on a large-scale server system demonstrate that S-Net delivers very similar scalability and absolute performance on the studied examples as tuned CnC codes do, even without specific tuning. At the same time, S-Net does achieve a much more complete separation of concerns between compute and coordination layers than CnC even intends to.Peer reviewedFinal Accepted Versio

    OpenMP to CUDA graphs: a compiler-based transformation to enhance the programmability of NVIDIA devices

    Get PDF
    Heterogeneous computing is increasingly being used in a diversity of computing systems, ranging from HPC to the real-time embedded domain, to cope with the performance requirements. Due to the variety of accelerators, e.g., FPGAs, GPUs, the use of high-level parallel programming models is desirable to exploit the performance capabilities of them, while maintaining an adequate productivity level. In that regard, OpenMP is a well-known high-level programming model that incorporates powerful task and accelerator models capable of efficiently exploiting structured and unstructured parallelism in heterogeneous computing. This paper presents a novel compiler transformation technique that automatically transforms OpenMP code into CUDA graphs, combining the benefits of programmability of a high-level programming model such as OpenMP, with the performance benefits of a low-level programming model such as CUDA. Evaluations have been performed on two NVIDIA GPUs from the HPC and embedded domains, i.e., the V100 and the Jetson AGX respectively.This work has been supported by the EU H2020 project AMPERE under the grant agreement no. 871669.Peer ReviewedPostprint (author's final draft
    • …
    corecore