1,818 research outputs found

    HALO 1.0: A Hardware-agnostic Accelerator Orchestration Framework for Enabling Hardware-agnostic Programming with True Performance Portability for Heterogeneous HPC

    Full text link
    This paper presents HALO 1.0, an open-ended extensible multi-agent software framework that implements a set of proposed hardware-agnostic accelerator orchestration (HALO) principles. HALO implements a novel compute-centric message passing interface (C^2MPI) specification for enabling the performance-portable execution of a hardware-agnostic host application across heterogeneous accelerators. The experiment results of evaluating eight widely used HPC subroutines based on Intel Xeon E5-2620 CPUs, Intel Arria 10 GX FPGAs, and NVIDIA GeForce RTX 2080 Ti GPUs show that HALO 1.0 allows for a unified control flow for host programs to run across all the computing devices with a consistently top performance portability score, which is up to five orders of magnitude higher than the OpenCL-based solution.Comment: 21 page

    A parallel adaptive mesh refinement algorithm

    Get PDF
    Over recent years, Adaptive Mesh Refinement (AMR) algorithms which dynamically match the local resolution of the computational grid to the numerical solution being sought have emerged as powerful tools for solving problems that contain disparate length and time scales. In particular, several workers have demonstrated the effectiveness of employing an adaptive, block-structured hierarchical grid system for simulations of complex shock wave phenomena. Unfortunately, from the parallel algorithm developer's viewpoint, this class of scheme is quite involved; these schemes cannot be distilled down to a small kernel upon which various parallelizing strategies may be tested. However, because of their block-structured nature such schemes are inherently parallel, so all is not lost. In this paper we describe the method by which Quirk's AMR algorithm has been parallelized. This method is built upon just a few simple message passing routines and so it may be implemented across a broad class of MIMD machines. Moreover, the method of parallelization is such that the original serial code is left virtually intact, and so we are left with just a single product to support. The importance of this fact should not be underestimated given the size and complexity of the original algorithm

    State-of-the-Art in Parallel Computing with R

    Get PDF
    R is a mature open-source programming language for statistical computing and graphics. Many areas of statistical research are experiencing rapid growth in the size of data sets. Methodological advances drive increased use of simulations. A common approach is to use parallel computing. This paper presents an overview of techniques for parallel computing with R on computer clusters, on multi-core systems, and in grid computing. It reviews sixteen different packages, comparing them on their state of development, the parallel technology used, as well as on usability, acceptance, and performance. Two packages (snow, Rmpi) stand out as particularly useful for general use on computer clusters. Packages for grid computing are still in development, with only one package currently available to the end user. For multi-core systems four different packages exist, but a number of issues pose challenges to early adopters. The paper concludes with ideas for further developments in high performance computing with R. Example code is available in the appendix

    Visual object-oriented development of parallel applications

    Get PDF
    PhD ThesisDeveloping software for parallel architectures is a notoriously difficult task, compounded further by the range of available parallel architectures. There has been little research effort invested in how to engineer parallel applications for more general problem domains than the traditional numerically intensive domain. This thesis addresses these issues. An object-oriented paradigm for the development of general-purpose parallel applications, with full lifecycle support, is proposed and investigated, and a visual programming language to support that paradigm is developed. This thesis presents experiences and results from experiments with this new model for parallel application development.Engineering and Physical Sciences Research Council

    Computational Aerodynamics on unstructed meshes

    Get PDF
    New 2D and 3D unstructured-grid based flow solvers have been developed for simulating steady compressible flows for aerodynamic applications. The codes employ the full compressible Euler/Navier-Stokes equations. The Spalart-Al Imaras one equation turbulence model is used to model turbulence effects of flows. The spatial discretisation has been obtained using a cell-centred finite volume scheme on unstructured-grids, consisting of triangles in 2D and of tetrahedral and prismatic elements in 3D. The temporal discretisation has been obtained with an explicit multistage Runge-Kutta scheme. An "inflation" mesh generation technique is introduced to effectively reduce the difficulty in generating highly stretched 2D/3D viscous grids in regions near solid surfaces. The explicit flow method is accelerated by the use of a multigrid method with consideration of the high grid aspect ratio in viscous flow simulations. A solution mesh adaptation technique is incorporated to improve the overall accuracy of the 2D inviscid and viscous flow solutions. The 3D flow solvers are parallelised in a MIMD fashion aimed at a PC cluster system to reduce the computing time for aerodynamic applications. The numerical methods are first applied to several 2D inviscid flow cases, including subsonic flow in a bump channel, transonic flow around a NACA0012 airfoil and transonic flow around the RAE 2822 airfoil to validate the numerical algorithms. The rest of the 2D case studies concentrate on viscous flow simulations including laminar/turbulent flow over a flat plate, transonic turbulent flow over the RAE 2822 airfoil, and low speed turbulent flows in a turbine cascade with massive separations. The results are compared to experimental data to assess the accuracy of the method. The over resolved problem with mesh adaptation on viscous flow simulations is addressed with a two phase mesh reconstruction procedure. The solution convergence rate with the aspect ratio adaptive multigrid method and the direct connectivity based multigrid is assessed in several viscous turbulent flow simulations. Several 3D test cases are presented to validate the numerical algorithms for solving Euler/Navier-Stokes equations. Inviscid flow around the M6 wing airfoil is simulated on the tetrahedron based 3D flow solver with an upwind scheme and spatial second order finite volume method. The efficiency of the multigrid for inviscid flow simulations is examined. The efficiency of the parallelised 3D flow solver and the PC cluster system is assessed with simulations of the same case with different partitioning schemes. The present parallelised 3D flow solvers on the PC cluster system show satisfactory parallel computing performance. Turbulent flows over a flat plate are simulated with the tetrahedron based and prismatic based flow solver to validate the viscous term treatment. Next, simulation of turbulent flow over the M6 wing is carried out with the parallelised 3D flow solvers to demonstrate the overall accuracy of the algorithms and the efficiency of the multigrid method. The results show very good agreement with experimental data. A highly stretched and well-formed computational grid near the solid wall and wake regions is generated with the "inflation" method. The aspect ratio adaptive multigrid displayed a good acceleration rate. Finally, low speed flow around the NREL Phase 11 Wind turbine is simulated and the results are compared to the experimental data

    Investigating applications portability with the Uintah DAG-based runtime system on PetaScale supercomputers

    Get PDF
    pre-printPresent trends in high performance computing present formidable challenges for applications code using multicore nodes possibly with accelerators and/or co-processors and reduced memory while still attaining scalability. Software frameworks that execute machine-independent applications code using a runtime system that shields users from architectural complexities offer a possible solution. The Uintah framework for example, solves a broad class of large-scale problems on structured adaptive grids using fluid-flow solvers coupled with particle-based solids methods. Uintah executes directed acyclic graphs of computational tasks with a scalable asynchronous and dynamic runtime system for CPU cores and/or accelerators/coprocessors on a node. Uintah's clear separation between application and runtime code has led to scalability increases of 1000x without significant changes to application code. This methodology is tested on three leading Top500 machines; OLCF Titan, TACC Stampede and ALCF Mira using three diverse and challenging applications problems. This investigation of scalability with regard to the different processors and communications performance leads to the overall conclusion that the adaptive DAG-based approach provides a very powerful abstraction for solving challenging multi-scale multi-physics engineering problems on some of the largest and most powerful computers available today

    Designing a scalable dynamic load -balancing algorithm for pipelined single program multiple data applications on a non-dedicated heterogeneous network of workstations

    Get PDF
    Dynamic load balancing strategies have been shown to be the most critical part of an efficient implementation of various applications on large distributed computing systems. The need for dynamic load balancing strategies increases when the underlying hardware is a non-dedicated heterogeneous network of workstations (HNOW). This research focuses on the single program multiple data (SPMD) programming model as it has been extensively used in parallel programming for its simplicity and scalability in terms of computational power and memory size.;This dissertation formally defines and addresses the problem of designing a scalable dynamic load-balancing algorithm for pipelined SPMD applications on non-dedicated HNOW. During this process, the HNOW parameters, SPMD application characteristics, and load-balancing performance parameters are identified.;The dissertation presents a taxonomy that categorizes general load balancing algorithms and a methodology that facilitates creating new algorithms that can harness the HNOW computing power and still preserve the scalability of the SPMD application.;The dissertation devises a new algorithm, DLAH (Dynamic Load-balancing Algorithm for HNOW). DLAH is based on a modified diffusion technique, which incorporates the HNOW parameters. Analytical performance bound for the worst-case scenario of the diffusion technique has been derived.;The dissertation develops and utilizes an HNOW simulation model to conduct extensive simulations. These simulations were used to validate DLAH and compare its performance to related dynamic algorithms. The simulations results show that DLAH algorithm is scalable and performs well for both homogeneous and heterogeneous networks. Detailed sensitivity analysis was conducted to study the effects of key parameters on performance

    RTLabOS Feasibility Studies

    Get PDF

    BeeFlow: Behavior Tree-based Serverless Workflow Modeling and Scheduling for Resource-Constrained Edge Clusters

    Full text link
    Serverless computing has gained popularity in edge computing due to its flexible features, including the pay-per-use pricing model, auto-scaling capabilities, and multi-tenancy support. Complex Serverless-based applications typically rely on Serverless workflows (also known as Serverless function orchestration) to express task execution logic, and numerous application- and system-level optimization techniques have been developed for Serverless workflow scheduling. However, there has been limited exploration of optimizing Serverless workflow scheduling in edge computing systems, particularly in high-density, resource-constrained environments such as system-on-chip clusters and single-board-computer clusters. In this work, we discover that existing Serverless workflow scheduling techniques typically assume models with limited expressiveness and cause significant resource contention. To address these issues, we propose modeling Serverless workflows using behavior trees, a novel and fundamentally different approach from existing directed-acyclic-graph- and state machine-based models. Behavior tree-based modeling allows for easy analysis without compromising workflow expressiveness. We further present observations derived from the inherent tree structure of behavior trees for contention-free function collections and awareness of exact and empirical concurrent function invocations. Based on these observations, we introduce BeeFlow, a behavior tree-based Serverless workflow system tailored for resource-constrained edge clusters. Experimental results demonstrate that BeeFlow achieves up to 3.2X speedup in a high-density, resource-constrained edge testbed and 2.5X speedup in a high-profile cloud testbed, compared with the state-of-the-art.Comment: Accepted by Journal of Systems Architectur
    corecore