15 research outputs found

    A Team-Based Methodology of Memory Hierarchy-Aware Runtime Support in Coarray Fortran

    No full text
    International audience—We describe how 2-level memory hierarchies can be exploited to optimize the implementation of teams in the parallel facet of the upcoming Fortran 2015 standard. We focus on reducing the cost associated with moving data within a computing node and between nodes, finding that this distinction is of key importance when looking at performance issues. We introduce a new hardware-aware approach for PGAS, to be used within a runtime system, to optimize the communications in the virtual topologies and clusters that are binding different teams together. We have applied, and implemented into the CAF OpenUH compiler, this methodology to three important collective operations, namely barrier, all-to-all reduction and one-to-all broadcast; this is the first Fortran compiler that both provides teams and handles such a memory hierarchy methodology within teams I. INTRODUCTION The emergence of many-core computing nodes in large-scale distributed systems requires programming model im-plementers to consider more carefully memory hierarchy when looking at performance issues. Most parallel applications are programmed using the Message Passing Interface (MPI) [1], where multiple processes execute in a coordinated manner, communicating by performing send and receive operations. More recently, several languages and libraries have added support for explicit or implicit remote memory access (RMA) using so-called " one-sided communication " , including languages following the Partitioned Global Address Space (PGAS) paradigm as well as MPI (MPI-2 added RMA to the interface and MPI-3 made significant refinements to better support it). Of special note is the Fortran 2008 addition for supporting coarrays, a language mechanism that enables RMA as a natural extension to Fortran's array syntax, informally named CAF 1. In this paradigm, an image is an executing process in a Single Program Multiple Data (SPMD) program with its own cop

    A framework for unit testing with coarray Fortran

    Get PDF
    Parallelism is a ubiquitous feature of modern computing architectures; indeed, we might even say that serial code is now automatically legacy code. Writing parallel code poses significant challenges to programs, and is often error-prone. Partitioned Global Address Space (PGAS) languages, such as Coarray Fortran (CAF), represent a promising development direction in the quest for a trade-off between simplicity and performance. CAF is a parallel programming model that allows a smooth migration from serial to parallel code. However, despite CAF simplicity, refactoring serial code and migrating it to parallel versions is still error-prone, especially in complex softwares. The combination of unit testing, which drastically reduces defect injection, and CAF is therefore a very appealing prospect; however, it requires appropriate tools to realize its potential. In this paper, we present the first CAF-compatible framework for unit tests, developed as an extension to the Parallel Fortran Unit Test framework (pFUnit)

    Proceedings of the 7th International Conference on PGAS Programming Models

    Get PDF

    Survey of Novel Programming Models for Parallelizing Applications at Exascale

    Full text link

    Parallel Processes in HPX: Designing an Infrastructure for Adaptive Resource Management

    Get PDF
    Advancement in cutting edge technologies have enabled better energy efficiency as well as scaling computational power for the latest High Performance Computing(HPC) systems. However, complexity, due to hybrid architectures as well as emerging classes of applications, have shown poor computational scalability using conventional execution models. Thus alternative means of computation, that addresses the bottlenecks in computation, is warranted. More precisely, dynamic adaptive resource management feature, both from systems as well as application\u27s perspective, is essential for better computational scalability and efficiency. This research presents and expands the notion of Parallel Processes as a placeholder for procedure definitions, targeted at one or more synchronous domains, meta data for computation and resource management as well as infrastructure for dynamic policy deployment. In addition to this, the research presents additional guidelines for a framework for resource management in HPX runtime system. Further, this research also lists design principles for scalability of Active Global Address Space (AGAS), a necessary feature for Parallel Processes. Also, to verify the usefulness of Parallel Processes, a preliminary performance evaluation of different task scheduling policies is carried out using two different applications. The applications used are: Unbalanced Tree Search, a reference dynamic graph application, implemented by this research in HPX and MiniGhost, a reference stencil based application using bulk synchronous parallel model. The results show that different scheduling policies provide better performance for different classes of applications; and for the same application class, in certain instances, one policy fared better than the others, while vice versa in other instances, hence supporting the hypothesis of the need of dynamic adaptive resource management infrastructure, for deploying different policies and task granularities, for scalable distributed computing

    Data layout types : a type-based approach to automatic data layout transformations for improved SIMD vectorisation

    Get PDF
    The increasing complexity of modern hardware requires sophisticated programming techniques for programs to run efficiently. At the same time, increased power of modern hardware enables more advanced analyses to be included in compilers. This thesis focuses on one particular optimisation technique that improves utilisation of vector units. The foundation of this technique is the ability to chose memory mappings for data structures of a given program. Usually programming languages use a fixed layout for logical data structures in physical memory. Such a static mapping often has a negative effect on usability of vector units. In this thesis we consider a compiler for a programming language that allows every data structure in a program to have its own data layout. We make sure that data layouts across the program are sound, and most importantly we solve a problem of automatic data layout reconstruction. To consistently do this, we formulate this as a type inference problem, where type encodes a data layout for a given structure as well as implied program transformations. We prove that type-implied transformations preserve semantics of the original programs and we demonstrate significant performance improvements when targeting SIMD-capable architectures

    UPCBLAS : a numerical library for unified parallel C with architecture-aware optimizations

    Get PDF
    [Abstract] The popularity of Partitioned Global Address Space (PGAS) languages has increased during the last years thanks to their high programmability and performance through an efficient exploitation of data locality, especially on hierarchical architectures like multicore clusters. This PhD Thesis describes UPCBLAS, a parallel library for numerical computation using the PGAS Unified Parallel C (UPC) language. The routines are built on top of sequential BLAS and SparseBLAS functions and exploit the particularities of the PGAS paradigm, taking into account data locality in order to achieve a good performance. However, the growing complexity in computer system hierarchies due to the increase in the number of cores per processor, levels of cache (some of them shared) and the number of processors per node, as well as the high-speed interconnects, demands the use of new optimization techniques and libraries that take advantage of their features. For this reason, this Thesis also presents Servet, a suite of benchmarks focused on detecting a set of parameters with high in uence on the overall performance of multicore systems. UPCBLAS routines use the hardware parameters provided by Servet to implement optimization techniques that improve their performance. The performance of the library has been experimentally evaluated on several multicore supercomputers and compared to message-passing-based parallel numerical libraries, demonstrating good scalability and efficiency. UPCBLAS has also been used to develop more complex numerical codes in order to demonstrate that it is a good alternative to MPI-based libraries for increasing the productivity of numerical application developers

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

    STAPL-RTS: A Runtime System for Massive Parallelism

    Get PDF
    Modern High Performance Computing (HPC) systems are complex, with deep memory hierarchies and increasing use of computational heterogeneity via accelerators. When developing applications for these platforms, programmers are faced with two bad choices. On one hand, they can explicitly manage machine resources, writing programs using low level primitives from multiple APIs (e.g., MPI+OpenMP), creating efficient but rigid, difficult to extend, and non-portable implementations. Alternatively, users can adopt higher level programming environments, often at the cost of lost performance. Our approach is to maintain the high level nature of the application without sacrificing performance by relying on the transfer of high level, application semantic knowledge between layers of the software stack at an appropriate level of abstraction and performing optimizations on a per-layer basis. In this dissertation, we present the STAPL Runtime System (STAPL-RTS), a runtime system built for portable performance, suitable for massively parallel machines. While the STAPL-RTS abstracts and virtualizes the underlying platform for portability, it uses information from the upper layers to perform the appropriate low level optimizations that restore the performance characteristics. We outline the fundamental ideas behind the design of the STAPL-RTS, such as the always distributed communication model and its asynchronous operations. Through appropriate code examples and benchmarks, we prove that high level information allows applications written on top of the STAPL-RTS to attain the performance of optimized, but ad hoc solutions. Using the STAPL library, we demonstrate how this information guides important decisions in the STAPL-RTS, such as multi-protocol communication coordination and request aggregation using established C++ programming idioms. Recognizing that nested parallelism is of increasing interest for both expressivity and performance, we present a parallel model that combines asynchronous, one-sided operations with isolated nested parallel sections. Previous approaches to nested parallelism targeted either static applications through the use of blocking, isolated sections, or dynamic applications by using asynchronous mechanisms (i.e., recursive task spawning) which come at the expense of isolation. We combine the flexibility of dynamic task creation with the isolation guarantees of the static models by allowing the creation of asynchronous, one-sided nested parallel sections that work in tandem with the more traditional, synchronous, collective nested parallelism. This allows selective, run-time customizable use of parallelism in an application, based on the input and the algorithm

    Design and implementation of an array language for computational science on a heterogeneous multicore architecture

    Get PDF
    The packing of multiple processor cores onto a single chip has become a mainstream solution to fundamental physical issues relating to the microscopic scales employed in the manufacture of semiconductor components. Multicore architectures provide lower clock speeds per core, while aggregate floating-point capability continues to increase. Heterogeneous multicore chips, such as the Cell Broadband Engine (CBE) and modern graphics chips, also address the related issue of an increasing mismatch between high processor speeds, and huge latency to main memory. Such chips tackle this memory wall by the provision of addressable caches; increased bandwidth to main memory; and fast thread context switching. An associated cost is often reduced functionality of the individual accelerator cores; and the increased complexity involved in their programming. This dissertation investigates the application of a programming language supporting the first-class use of arrays; and capable of automatically parallelising array expressions; to the heterogeneous multicore domain of the CBE, as found in the Sony PlayStation 3 (PS3). The language is a pre-existing and well-documented proper subset of Fortran, known as the ‘F’ programming language. A bespoke compiler, referred to as E , is developed to support this aim, and written in the Haskell programming language. The output of the compiler is in an extended C++ dialect known as Offload C++, which targets the PS3. A significant feature of this language is its use of multiple, statically typed, address spaces. By focusing on generic, polymorphic interfaces for both the generated and hand constructed code, a number of interesting design patterns relating to the memory locality are introduced. A suite of medium-sized (100-700 lines), real-world benchmark programs are used to evaluate the performance, correctness, and scalability of the compiler technology. Absolute speedup values, well in excess of one, are observed for all of the programs. The work ultimately demonstrates that an array language can significantly reduce the effort expended to utilise a parallel heterogeneous multicore architecture, while retaining high performance. A substantial, related advantage in using standard ‘F’ is that any Fortran compiler can create debuggable, and competitively performing serial programs
    corecore