77 research outputs found

    Optimizing iterative data-flow scientific applications using directed cyclic graphs

    Get PDF
    Data-flow programming models have become a popular choice for writing parallel applications as an alternative to traditional work-sharing parallelism. They are better suited to write applications with irregular parallelism that can present load imbalance. However, these programming models suffer from overheads related to task creation, scheduling and dependency management, limiting performance and scalability when tasks become too small. At the same time, many HPC applications implement iterative methods or multi-step simulations that create the same directed acyclic graphs of tasks on each iteration. By giving application programmers a way to express that a specific loop is creating the same task pattern on each iteration, we can create a single task directed acyclic graph (DAG) once and transform it into a cyclic graph. This cyclic graph is then reused for successive iterations, minimizing task creation and dependency management overhead. This paper presents the taskiter, a new construct we propose for the OmpSs-2 and OpenMP programming models, allowing the use of directed cyclic task graphs (DCTG) to minimize runtime overheads. Moreover, we present a simple immediate successor locality-aware heuristic that minimizes task scheduling overhead by bypassing the runtime task scheduler. We evaluate the implementation of the taskiter and the immediate successor heuristic in 8 iterative benchmarks. Using small task granularities, we obtain a geometric mean speedup of 2.56x over the reference OmpSs-2 implementation, and a 3.77x and 5.2x speedup over the LLVM and GCC OpenMP runtimes, respectively.This work was supported in part by the European Union’s Horizon 2020/EuroHPC Research and Innovation Programme (DEEP-SEA) under Grant 955606; in part by the Spanish State Research Agency—Ministry of Science and Innovation, Generalitat de Catalunya, under Project PCI2021121958 and Project 2021-SGR-01007; in part by the Spanish Ministry of Science and Technology under Contract PID2019-107255GB; and in part by Severo Ochoa under Grant CEX2021-001148-S/MCIN/AEI/10.13039/501100011033.Peer ReviewedPostprint (published version

    Saiph, a domain specific language for computational fluid dynamics simulations

    Get PDF
    Nowadays, High-Performance Computing (HPC) is assuming an increasingly central role in scientific research while computer architectures are becoming more and more hetero-geneous and using different parallel programming models and techniques. Under this scenario, the only way to successfully exploit an HPC system requires that computer and domain scientists work closely towards producing applications to solve domain problems, ensuring productivity and performance at the same time. Facing such purpose, Saiph is a Domain Specific Language designed to ease the task of solving couple and uncouple Partial Differential Equations (PDE’s), with a primary focusing on Computational Fluid Dynamics (CFD) applications. Saiph allows to model complex physical phenomena featured by PDE’s, easing the use of numerical methods and optimizations on different computer architectures to the users

    Supporting automatic recovery in offloaded distributed programming models through MPI-3 techniques

    Get PDF
    In this paper we describe the design of fault tolerance capabilities for general-purpose offload semantics, based on the OmpSs programming model. Using ParaStation MPI, a production MPI-3.1 implementation, we explore the features that, being standard compliant, an MPI stack must support to provide the necessary fault tolerance guarantees, based on MPI's dynamic process management. Our results, including synthetic benchmarks and applications, reveal low runtime overhead and efficient recovery, demonstrating that the existing MPI standard provided us with sufficient mechanisms to implement an effective and efficient fault-tolerant solution.This research received funding from the European Community’s 7th Framework Programme via the DEEP-ER project under Grant Agreement no. 610476. This work has also been supported by the Spanish Ministry of Science and Innovation (contract TIN2012-34557) and by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266. The authors thank Jorge BellÂŽon, from BSC, for his technical support with the Nanos++ internals.Peer ReviewedPostprint (author's final draft

    A native tensor-vector multiplication algorithm for high performance computing

    Get PDF
    Tensor computations are important mathematical operations for applications that rely on multidimensional data. The tensor-vector multiplication (TVM) is the most memory-bound tensor contraction in this class of operations. This paper proposes an open-source TVM algorithm which is much simpler and efficient than previous approaches, making it suitable for integration in the most popular BLAS libraries available today. Our algorithm has been written from scratch and features unit-stride memory accesses, cache awareness, mode obliviousness, full vectorization and multi-threading as well as NUMA awareness for non-hierarchically stored dense tensors. Numerical experiments are carried out on tensors up to order 10 and various compilers and hardware architectures equipped with traditional DDR and high bandwidth memory (HBM). For large tensors the average performance of the TVM ranges between 62% and 76% of the theoretical bandwidth for NUMA systems with DDR memory and remains independent of the contraction mode. On NUMA systems with HBM the TVM exhibits some mode dependency but manages to reach performance figures close to peak values. Finally, the higher-order power method is benchmarked with the proposed TVM kernel and delivers on average between 58% and 69% of the theoretical bandwidth for large tensors.This work was supported in part by MCIN/AEI and ESF under Grant RYC2019-027592-I, and in part by the HPC Technology Innovation Lab, a Barcelona Supercomputing Center and Huawei research cooperation agreement (2020).Peer ReviewedPostprint (author's final draft

    A toolchain to verify the parallelization of OmpSs-2 applications

    Get PDF
    Programming models for task-based parallelization based on compile-time directives are very effective at uncovering the parallelism available in HPC applications. Despite that, the process of correctly annotating complex applications is error-prone and may hinder the general adoption of these models. In this paper, we target the OmpSs-2 programming model and present a novel toolchain able to detect parallelization errors coming from non-compliant OmpSs-2 applications. Our toolchain verifies the compliance with the OmpSs-2 programming model using local task analysis to deal with each task separately, and structural induction to extend the analysis to the whole program. To improve the effectiveness of our tools, we also introduce some ad-hoc verification annotations, which can be used manually or automatically to disable the analysis of specific code regions. Experiments run on a sample of representative kernels and applications show that our toolchain can be successfully used to verify the parallelization of complex real-world applications.This project is supported by the European Union’s Horizon 2021 research and innovation programme under grant agreement No 754304 (DEEP-EST), by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 871669 (AMPERE) and the Project HPCEUROPA3 (INFRAIA-2016-1-730897), by the Ministry of Economy of Spain through the Severo Ochoa Center of Excellence Program (SEV-2015-0493), by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), and by the Generalitat de Catalunya (2017-SGR-1481).Peer ReviewedPostprint (author's final draft

    Iteration-fusing conjugate gradient for sparse linear systems with MPI + OmpSs

    Get PDF
    In this paper, we target the parallel solution of sparse linear systems via iterative Krylov subspace-based method enhanced with a block-Jacobi preconditioner on a cluster of multicore processors. In order to tackle large-scale problems, we develop task-parallel implementations of the preconditioned conjugate gradient method that improve the interoperability between the message-passing interface and OmpSs programming models. Specifically, we progressively integrate several communication-reduction and iteration-fusing strategies into the initial code, obtaining more efficient versions of the method. For all these implementations, we analyze the communication patterns and perform a comparative analysis of their performance and scalability on a cluster consisting of 32 nodes with 24 cores each. The experimental analysis shows that the techniques described in the paper outperform the classical method by a margin that varies between 6 and 48%, depending on the evaluation

    CellMT: A cooperative multithreading library for the Cell/B.E.

    Get PDF
    The Cell BE processor has proved that heterogeneous multi-core systems can provide a huge computational power with high efficiency for a wide range of applications. The simple design of the computational units and the use of small managed local memories is the key to achieve high efficiency and performance at the same time. However, this simple and efficient hardware design comes at the price of higher code complexity. The code written to run in this kind of processors must deal with several issues such as code vectorization, loop unrolling or the explicit management of local memories. Some of these issues such as vectorization or loop unrolling can be partially solved by the compiler, but the overlapping of data transfer and computation times must be manually addressed by the programmer with techniques such as double buffering that increase the code complexity. In this paper we present a user level threading library called CellMT that effectively hide memory latencies. The concurrent execution of several threads inside each SPU naturally overlaps computation and data transfer times without increasing the code complexity. To prove the suitability and feasibility of our multi-threaded library, we perform an exhaustive performance evaluation with a synthetic benchmark and a real application. The experimental results show that the multithreaded approach can outperform a hand-coded double buffering scheme, with speedups from 0.96x to 3.2x, while maintaining the complexity of a naive buffering scheme.Peer ReviewedPostprint (published version

    Adaptive and architecture-independent task granularity for recursive applications

    Get PDF
    In the last few decades, modern applications have become larger and more complex. Among the users of these applications, the need to simplify the process of identifying units of work increased as well. With the approach of tasking models, this want has been satisfied. These models make scheduling units of work much more user-friendly. However, with the arrival of tasking models, came granularity management. Discovering an application’s optimal granularity is a frequent and sometimes challenging task for a wide range of recursive algorithms. Often, finding the optimal granularity will cause a substantial increase in performance. With that in mind, the quest for optimality is no easy task. Many aspects have to be considered that are directly related to lack or excess of parallelism in applications. There is no general solution as the optimal granularity depends on both algorithm and system characteristics. One commonly used method to find an optimal granularity consists in experimentally tuning an application with different granularities until an optimal is found. This paper proposes several heuristics which, combined with the appropriate monitoring techniques, allow a runtime system to automatically tune the granularity of recursive applications. The solution is independent of the architecture, execution environment or application being tested. A reference implementation in OmpSs—a task-parallel programming model—shows the programmability, ease of use and competitive performance of the proposed solution. Results show that the proposed solution is able to achieve, for any scenario, at least 75% of the performance of optimally tuned applications.This work has been supported by the Spanish Ministry of Science and Innovation (contract TIN2015-65316), the grant SEV-2015-0493 of Severo Ochoa Program awarded by the Spanish Government, and by Generalitat de Catalunya (contract 2014-SGR-1051)Peer ReviewedPostprint (author's final draft

    Efficient Scalable Computing through Flexible Applications and Adaptive Workloads

    Get PDF
    In this paper we introduce a methodology for dynamic job reconfiguration driven by the programming model runtime in collaboration with the global resource manager. We improve the system throughput by exploiting malleability techniques (in terms of number of MPI ranks) through the reallocation of resources assigned to a job during its execution. In our proposal, the OmpSs runtime reconfigures the number of MPI ranks during the execution of an application in cooperation with the Slurm workload manager. In addition, we take advantage of OmpSs offload semantics to allow application developers deal with data redistribution. By combining these elements a job is able to expand itself in order to exploit idle nodes or be shrunk if other queued jobs could be initiated. This novel approach adapts the system workload in order to increase the throughput as well as make a smarter use of the underlying resources. Our experiments demonstrate that this approach can reduce the total execution time of a practical workload by more than 40% while reducing the amount of resources by 30%.This work is supported by the Project TIN2014-53495-R and TIN2015-65316-P from MINECO and FEDER. Antonio J. Peña is cofinanced by MINECO under Juan de la Cierva fellowship number IJCI-2015-23266. Special thanks to José I. Aliaga for the conjugate gradient code.Peer ReviewedPostprint (author's final draft
    • 

    corecore