7 research outputs found

    XcalableMP PGAS Programming Language

    Get PDF
    XcalableMP is a directive-based parallel programming language based on Fortran and C, supporting a Partitioned Global Address Space (PGAS) model for distributed memory parallel systems. This open access book presents XcalableMP language from its programming model and basic concept to the experience and performance of applications described in XcalableMP.  XcalableMP was taken as a parallel programming language project in the FLAGSHIP 2020 project, which was to develop the Japanese flagship supercomputer, Fugaku, for improving the productivity of parallel programing. XcalableMP is now available on Fugaku and its performance is enhanced by the Fugaku interconnect, Tofu-D. The global-view programming model of XcalableMP, inherited from High-Performance Fortran (HPF), provides an easy and useful solution to parallelize data-parallel programs with directives for distributed global array and work distribution and shadow communication. The local-view programming adopts coarray notation from Coarray Fortran (CAF) to describe explicit communication in a PGAS model. The language specification was designed and proposed by the XcalableMP Specification Working Group organized in the PC Consortium, Japan. The Omni XcalableMP compiler is a production-level reference implementation of XcalableMP compiler for C and Fortran 2008, developed by RIKEN CCS and the University of Tsukuba. The performance of the XcalableMP program was used in the Fugaku as well as the K computer. A performance study showed that XcalableMP enables a scalable performance comparable to the message passing interface (MPI) version with a clean and easy-to-understand programming style requiring little effort

    Run-time support for multi-level disjoint memory address spaces

    Get PDF
    High Performance Computing (HPC) systems have become widely used tools in many industry areas and research fields. Research to produce more powerful and efficient systems has grown in par with their popularity. As a consequence, the complexity of modern HPC architectures has increased in order to provide systems with the highest levels of performance. This increased complexity has also affected the way HPC systems are programmed. HPC users have to deal with new devices, languages and tools, and this is can be a significant access barrier to people that do not have a deep knowledge in computer science. On par with the evolution of HPC systems, programming models have also evolved to ease the task of developing applications for these machines. Two well-known examples are OpenMP and MPI. The former can be used in shared memory systems and is praised for offering an easy methodology of software development. The latter is more popular because it targets distributed environments but it is considered burdensome to use. Besides these two, many programming models have emerged to propose new methodologies or to handle new hardware devices. One of these models is OmpSs. OmpSs is a programming model for modern HPC systems that is based on OpenMP and StarSs. Developed by the Programming Models group at the Barcelona Supercomputing Center, it targets the latest generation of HPC systems while benefiting from the ease of use of OpenMP. OmpSs offers asynchronous parallelism with the concept of tasks with data dependencies. These tasks allow the specification of sections of code that can be executed in parallel while the dependencies specify the restrictions about the order in which the tasks can be executed. With this, OmpSs programs can adapt to a many different system configurations while fundamentally still being sequential programs with annotations. This thesis explores the benefits of providing OmpSs the capability to target architectures with complex memory hierarchies. An example of such systems can be the new generation of clusters that use accelerators to power their computing capabilities. The memory hierarchy of these machines is composed of a first level of distributed memory formed by the memory of each individual node, and a second level formed by the private memory of each accelerator devices. Our first contribution shows the implementation of the support of cluster of multi-cores for the OmpSs programming model. We also present two optimizations to boost the performance of applications running on top of cluster systems: a specific task scheduling policy and the addition of slave-to-slave transfers. We evaluate our implementation using a set of benchmarks coded in OmpSs and we also compare them against the same applications implemented using MPI, the most widely used programming model for these systems. We extend our initial implementation in our second contribution, which provides OmpSs with support for clusters of GPUs. We show that OmpSs programs targeting these complex systems are capable of achieving a good performance when compared against MPI+CUDA implementations. The third contribution of this thesis presents an implementation and evaluation of the performance and programmability impact of supporting non-contiguous memory regions. Offering this feature allows applications with complex data accesses to be easily annotated with OmpSs. This is important to widen the spectrum of applications that can be handled by the programming model.Els sistemes de computació d'altes prestacions (CAP) han esdevingut eines importants en diferents sectors industrials i camps de recerca. La recerca per produir sistemes més potents i eficients ha crescut proporcionalment a aquesta popularitat. Com a conseqüència, la complexitat d'aquest tipus de sistemes s'ha incrementat per tal de dotar-los d'altes prestacions. Aquest increment en la complexitat també ha afectat la manera de programar aquest tipus de sistemes. Els usuaris de sistemes CAP han de treballar amb nous dispositius, llenguatges i eines, i això pot convertir-se en una barrera d'entrada significativa per aquelles persones que no tinguin uns alts coneixements informàtics. Seguin l'evolució dels sistemes CAP, els models de programació també han evolucionat per tal de facilitar la tasca de desenvolupar aplicacions per aquests sistemes. Dos exemples ben coneguts son OpenMP i MPI. El primer es pot utilitzar en sistemes de memòria compartida i es reconegut per oferir una metodologia de desenvolupament senzilla. El segon és més popular perquè està dissenyat per sistemes distribuïts, però està considerat difícil d'utilitzar. A part d'aquests dos, altres models de programació han sorgit per proposar noves metodologies o per suportar nous components hardware. Un d'aquests nous models és OmpSs. OmpSs és un model de programació per sistemes CAP moderns que està basat en OpenMP i StarSs. Desenvolupat pel grup de Models de Programació del Barcelona Supercomputing Center, està dissenyat per suportar la darrera generació de sistemes CAP i alhora oferir la facilitat d'us d'OpenMP. OmpSs ofereix paral·lelisme asíncron mitjançant el concepte de tasques amb dependències de dades. Aquestes tasques permeten especificar regions de codi que poden ser executades en paral·lel, mentre que les dependències especifiquen les restriccions sobre l'ordre en que aquestes tasques poden ser executades. Amb això, els programes fets amb OmpSs poden adaptar-se a sistemes amb diferents configuracions tot i ser fonamentalment programes seqüencials amb anotacions. Aquesta tesi explora els beneficis de proveir a OmpSs amb la capacitat de funcionar sobre arquitectures amb jerarquies de memòria complexes. Un exemple d'un sistema així pot ser un dels clústers de nova generació que utilitzen acceleradors per tal d'oferir més capacitat de càlcul. La jerarquia de memòria en aquestes màquines està composada per un primer nivell de memòria distribuïda formada per la memòria de cada node individual, i el segon nivell està format per la memòria privada de cada accelerador. La primera contribució d'aquesta tesi mostra la implementació del suport de clústers de multi-cores pel model de programació OmpSs. També presentem dos optimitzacions per millorar el rendiment de les aplicacions quan s'executen en sistemes clúster: una política de planificació de tasques específica i la incorporació dels missatges entre nodes esclaus. Avaluem la nostra implementació usant un conjunt d'aplicacions programades en OmpSs i també les comparem amb les mateixes aplicacions implementades usant MPI, el model de programació més estès per aquest tipus de sistemes. En la segona contribució estenem la nostra implementació inicial per tal de dotar OmpSs de suport per clústers de GPUs. Mostrem que els programes OmpSs son capaços d'obtenir un bon rendiment sobre aquests tipus de sistemes, fins i tot quan els comparem amb versions implementades usant MPI+CUDA. La tercera contribució descriu la implementació i avaluació del rendiment i de l'impacte de suportar regions de memòria no contigües. Oferir aquesta funcionalitat permet implementar fàcilment amb OmpSs aplicacions amb accessos complexes a memòria, cosa que és important de cara a ampliar l'espectre d'aplicacions que poden ser tractades pel model de programació

    演算加速機構を持つ並列システム向けPGAS言語コンパイラの研究

    Get PDF
    筑波大学 (University of Tsukuba)201

    Effizientes Programmiermodell für OpenMP auf einem Cluster-basierten Many-Core-System

    Get PDF
    Da die Komplexität „System-on-Chip“ (SoC) auch weiterhin zunimmt, wird man die Herausforderungen aufgrund der Konvergenz der Software- und Hardwareentwicklung nicht ignorieren können. Dies gilt auch für den Umgang mit dem hierarchischen Design, in dem die Prozessorkerne in Clustern oder sogenannten „Tiles“ angeordnet werden, um mittels eines schnellen lokalen Speicherzugriffs eine geringe Latenz und eine hohe Bandbreite der lokalen Kommunikation zu gewährleisten. Aus der Sicht eines Programmierers ist es wünschenswert, sich diese Eigenheiten der Hardware zunutze zu machen und sie bei der Ausgestaltung der abstrakten Parallel-Programmierung gewissenhaft und zielführend zu berücksichtigen. Diese Dissertation überwindet viele Engpässe in Bezug auf die Skalierbarkeit Cluster-basierter Many-Core-Systeme und führt das Programmiermodell OpenMP zur Vereinfachung der Anwendungsentwicklung ein. OpenMP abstrahiert von der Sichtweise des Programmierers – und es werden Richtlinien eingeführt, mit denen Schleifen in Programmsequenzen eingeteilt werden, als Basis für die parallele Programmierung. In dieser Arbeit wird das OpenMP-Modell bespielhaft in einem konkreten Cluster-basierten Many-Core-System umgesetzt; dem Intel Single-Chip Cloud Computer (SCC). Es wird eine schlanke und hoch-optimierte Laufzeitschicht für die Ausführung von OpenMP sowie ein Speichermodell vorgestellt. Auf Basis dieser Laufzeitschicht wird der parallele Code automatisch von einem nativen Backend-Compiler (GCC 4.6) erzeugt, der mit der Laufzeitbibliothek verknüpft ist. Im Rahmen der Arbeit wird auf einen effizienten Designansatz für die OpenMP-Programmierung eingegangen, wobei der Intel SCC als Beispiel für Cluster-basierte Systeme zum Einsatz kommt. In nicht-Cache-kohärenten Systemen dient die SCC OpenMP Laufzeitbibliothek primär dazu, die folgenden Herausforderungen zu bewältigen: 1. Die Ausführung von unmodifizierten, bestehenden OpenMP Programmen auf solchen Systemen. 2. Die Portierung des OpenMP-Speichermodells auf den SCC. 3. Die Synchronisation der parallelen Threads, auf die ein beträchtlicher Anteil der Ausführungszeit einer Anwendung entfällt. Eine Reihe weiterer Beispiele, basierend auf verschiedenen gebräuchlichen Kernen und realen Anwendungen, untermauert die Tauglichkeit von OpenMP – und eine Reihe von Experimenten zeigt, wie dieses Modell zu einer deutlichen Beschleunigung (bis zu 48-fach) in verschiedenen parallelen Anwendungen führt.As the complexity of systems-on-chip (SoCs) continues to increase, it is no longer possible to ignore the challenges caused by the convergence of software and hardware development. This involves attempts to deal with the hierarchical design – in which several cores are grouped in clusters or tiles – to ensure low-latency, high-bandwidth local communication by relying on fast local memories. From a programmer’s perspec- tive, it is desirable to make use of these peculiarities of the hardware, which must be clearly and carefully taken into account when designing the support for high-level parallel programming models. This dissertation overcomes many scalability bottlenecks in cluster-based many-core systems and introduces the OpenMP programming model as a means of simplifying application development. OpenMP represents an abstraction of the programmer’s view by providing abundant directives that decompose loops in sequential programs and lead to parallel programs. In this work, the full OpenMP model is implemented on a specific instance of a cluster-based many-core system: the Intel Single-chip Cloud Computer (SCC). In this thesis, a lightweight and highly optimized runtime layer for OpenMP execution and memory model by generating the parallel code that is automatically compiled by native back-end compiler (GCC 4.6) that linked with the runtime library. In this dissertation, I will address an efficient design approach of the OpenMP pro- gramming model for the Intel SCC as an example for cluster-based systems. The SCC OpenMP runtime library is designed to cope with three main challenges in a non-cache coherent system: 1. Executing unmodified legacy OpenMP programs on such system. 2. Landing OpenMP memory model on the SCC. 3. Synchronization in the work of parallel threads accounts for a sizeable fraction of an application’s execution time. Furthermore, the effectiveness of OpenMP is demonstrated on a set of widely used kernels and real-world applications. An extensive set of experiments shows how this model achieves significant parallel speedups up to 48x in several applications

    Optimización del rendimiento y la eficiencia energética en sistemas masivamente paralelos

    Get PDF
    RESUMEN Los sistemas heterogéneos son cada vez más relevantes, debido a sus capacidades de rendimiento y eficiencia energética, estando presentes en todo tipo de plataformas de cómputo, desde dispositivos embebidos y servidores, hasta nodos HPC de grandes centros de datos. Su complejidad hace que sean habitualmente usados bajo el paradigma de tareas y el modelo de programación host-device. Esto penaliza fuertemente el aprovechamiento de los aceleradores y el consumo energético del sistema, además de dificultar la adaptación de las aplicaciones. La co-ejecución permite que todos los dispositivos cooperen para computar el mismo problema, consumiendo menos tiempo y energía. No obstante, los programadores deben encargarse de toda la gestión de los dispositivos, la distribución de la carga y la portabilidad del código entre sistemas, complicando notablemente su programación. Esta tesis ofrece contribuciones para mejorar el rendimiento y la eficiencia energética en estos sistemas masivamente paralelos. Se realizan propuestas que abordan objetivos generalmente contrapuestos: se mejora la usabilidad y la programabilidad, a la vez que se garantiza una mayor abstracción y extensibilidad del sistema, y al mismo tiempo se aumenta el rendimiento, la escalabilidad y la eficiencia energética. Para ello, se proponen dos motores de ejecución con enfoques completamente distintos. EngineCL, centrado en OpenCL y con una API de alto nivel, favorece la máxima compatibilidad entre todo tipo de dispositivos y proporciona un sistema modular extensible. Su versatilidad permite adaptarlo a entornos para los que no fue concebido, como aplicaciones con ejecuciones restringidas por tiempo o simuladores HPC de dinámica molecular, como el utilizado en un centro de investigación internacional. Considerando las tendencias industriales y enfatizando la aplicabilidad profesional, CoexecutorRuntime proporciona un sistema flexible centrado en C++/SYCL que dota de soporte a la co-ejecución a la tecnología oneAPI. Este runtime acerca a los programadores al dominio del problema, posibilitando la explotación de estrategias dinámicas adaptativas que mejoran la eficiencia en todo tipo de aplicaciones.ABSTRACT Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator utilization and system energy consumption, as well as making it difficult to adapt applications. Co-execution allows all devices to simultaneously compute the same problem, cooperating to consume less time and energy. However, programmers must handle all device management, workload distribution and code portability between systems, significantly complicating their programming. This thesis offers contributions to improve performance and energy efficiency in these massively parallel systems. The proposals address the following generally conflicting objectives: usability and programmability are improved, while ensuring enhanced system abstraction and extensibility, and at the same time performance, scalability and energy efficiency are increased. To achieve this, two runtime systems with completely different approaches are proposed. EngineCL, focused on OpenCL and with a high-level API, provides an extensible modular system and favors maximum compatibility between all types of devices. Its versatility allows it to be adapted to environments for which it was not originally designed, including applications with time-constrained executions or molecular dynamics HPC simulators, such as the one used in an international research center. Considering industrial trends and emphasizing professional applicability, CoexecutorRuntime provides a flexible C++/SYCL-based system that provides co-execution support for oneAPI technology. This runtime brings programmers closer to the problem domain, enabling the exploitation of dynamic adaptive strategies that improve efficiency in all types of applications.Funding: This PhD has been supported by the Spanish Ministry of Education (FPU16/03299 grant), the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and PID2019-105660RB-C22. This work has also been partially supported by the Mont-Blanc 3: European Scalable and Power Efficient HPC Platform based on Low-Power Embedded Technology project (G.A. No. 671697) from the European Union’s Horizon 2020 Research and Innovation Programme (H2020 Programme). Some activities have also been funded by the Spanish Science and Technology Commission under contract TIN2016-81840-REDT (CAPAP-H6 network). The Integration II: Hybrid programming models of Chapter 4 has been partially performed under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme. In particular, the author gratefully acknowledges the support of the SPMT Department of the High Performance Computing Center Stuttgart (HLRS)

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

    Design and implementation of an array language for computational science on a heterogeneous multicore architecture

    Get PDF
    The packing of multiple processor cores onto a single chip has become a mainstream solution to fundamental physical issues relating to the microscopic scales employed in the manufacture of semiconductor components. Multicore architectures provide lower clock speeds per core, while aggregate floating-point capability continues to increase. Heterogeneous multicore chips, such as the Cell Broadband Engine (CBE) and modern graphics chips, also address the related issue of an increasing mismatch between high processor speeds, and huge latency to main memory. Such chips tackle this memory wall by the provision of addressable caches; increased bandwidth to main memory; and fast thread context switching. An associated cost is often reduced functionality of the individual accelerator cores; and the increased complexity involved in their programming. This dissertation investigates the application of a programming language supporting the first-class use of arrays; and capable of automatically parallelising array expressions; to the heterogeneous multicore domain of the CBE, as found in the Sony PlayStation 3 (PS3). The language is a pre-existing and well-documented proper subset of Fortran, known as the ‘F’ programming language. A bespoke compiler, referred to as E , is developed to support this aim, and written in the Haskell programming language. The output of the compiler is in an extended C++ dialect known as Offload C++, which targets the PS3. A significant feature of this language is its use of multiple, statically typed, address spaces. By focusing on generic, polymorphic interfaces for both the generated and hand constructed code, a number of interesting design patterns relating to the memory locality are introduced. A suite of medium-sized (100-700 lines), real-world benchmark programs are used to evaluate the performance, correctness, and scalability of the compiler technology. Absolute speedup values, well in excess of one, are observed for all of the programs. The work ultimately demonstrates that an array language can significantly reduce the effort expended to utilise a parallel heterogeneous multicore architecture, while retaining high performance. A substantial, related advantage in using standard ‘F’ is that any Fortran compiler can create debuggable, and competitively performing serial programs
    corecore