112 research outputs found

    Mixed-data-model heterogeneous compilation and OpenMP offloading

    Get PDF
    Heterogeneous computers combine a general-purpose host processor with domain-specific programmable many-core accelerators, uniting high versatility with high performance and energy efficiency. While the host manages ever-more application memory, accelerators are designed to work mainly on their local memory. This difference in addressed memory leads to a discrepancy between the optimal address width of the host and the accelerator. Today 64-bit host processors are commonplace, but few accelerators exceed 32-bit addressable local memory, a difference expected to increase with 128-bit hosts in the exascale era. Managing this discrepancy requires support for multiple data models in heterogeneous compilers. So far, compiler support for multiple data models has not been explored, which hampers the programmability of such systems and inhibits their adoption. In this work, we perform the first exploration of the feasibility and performance of implementing a mixed-data-mode heterogeneous system. To support this, we present and evaluate the first mixed-data-model compiler, supporting arbitrary address widths on host and accelerator. To hide the inherent complexity and to enable high programmer productivity, we implement transparent offloading on top of OpenMP. The proposed compiler techniques are implemented in LLVM and evaluated on a 64+32-bit heterogeneous SoC. Results on benchmarks from the PolyBench-ACC suite show that memory can be transparently shared between host and accelerator at overheads below 0.7 % compared to 32-bit-only execution, enabling mixed-data-model computers to execute at near-native performance

    MetaFork: A Compilation Framework for Concurrency Models Targeting Hardware Accelerators

    Get PDF
    Parallel programming is gaining ground in various domains due to the tremendous computational power that it brings; however, it also requires a substantial code crafting effort to achieve performance improvement. Unfortunately, in most cases, performance tuning has to be accomplished manually by programmers. We argue that automated tuning is necessary due to the combination of the following factors. First, code optimization is machine-dependent. That is, optimization preferred on one machine may be not suitable for another machine. Second, as the possible optimization search space increases, manually finding an optimized configuration is hard. Therefore, developing new compiler techniques for optimizing applications is of considerable interest. This thesis aims at generating new techniques that will help programmers develop efficient algorithms and code targeting hardware acceleration technologies, in a more effective manner. Our work is organized around a compilation framework, called MetaFork, for concurrency platforms and its application to automatic parallelization. MetaFork is a high-level programming language extending C/C++, which combines several models of concurrency including fork-join, SIMD and pipelining parallelism. MetaFork is also a compilation framework which aims at facilitating the design and implementation of concurrent programs through four key features which make MetaFork unique and novel: (1) Perform automatic code translation between concurrency platforms targeting multi-core architectures. (2) Provide a high-level language for expressing concurrency as in the fork-join model, the SIMD paradigm and the pipelining parallelism. (3) Generate parallel code from serial code with an emphasis on code depending on machine or program parameters (e.g. cache size, number of processors, number of threads per thread block). (4) Optimize code depending on parameters that are unknown at compile-time

    Directive-based Approach to Heterogeneous Computing

    Get PDF
    El mundo de la computación de altas prestaciones está sufriendo grandes cambios que incrementan notablemente su complejidad. La incapacidad de los sistemas monoprocesador o incluso multiprocesador de mantener el incremento de la potencia de cómputo para suplir las necesidades de la comunidad científica ha forzado la irrupción de arquitecturas hardware masivamente paralelas y de unidades específicas para realizar operaciones concretas. Un buen ejemplo de este tipo de dispositivos son las GPU (Unidades de procesamiento gráfico). Estos dispositivos, tradicionalmente dedicados a la programación gráfica, se han convertido recientemente en una plataforma ideal para implementar cómputos masivamente paralelos. La combinación de GPUs para realizar tareas intensivas en cómputo con multi-procesadores para llevar tareas menos intensas pero con lógica de control más compleja, se ha convertido en los últimos años en una de las plataformas más comunes para la realización de cálculos científicos a bajo coste, dado que la potencia desplegada en muchos casos puede alcanzar la de clústers de pequeño o mediano tamaño, con un coste inicial y de mantenimiento notablemente inferior. La incorporación de GPUs en clústers ha permitido también aumentar la capacidad de éstos. Sin embargo, la complejidad de la programación de GPUs, y su integración con códigos existentes, dificultan enormemente la introducción de estas tecnologías entre usuarios menos expertos. En esta tésis exploramos la utilización de modelos de programación basados en directivas para este tipo de entornos, multi-core, many-core, GPUs y clústers, donde el usuario medio ve disminuida notablemente su productividad debido a la dificultad de programación en estos entornos. Para explorar la mejor forma de aplicar directivas en estos entornos, hemos desarrollado un conjunto de herramientas software altamente flexibles (un compilador y un runtime), que permiten explorar diversas técnicas con relativamente poco esfuerzo. La irrupción del estándar de programación de directivas de OpenACC nos permitió demostrar la capacidad de estas herramientas, realizando una implementación experimental del estándar (accULL) en muy poco tiempo y con un rendimiento nada desdeñable. Los resultados computacionales aportados nos permiten demostrar: (a) La disminución en el esfuerzo de programación que permiten las aproximaciones basadas en directivas, (b) La capacidad y flexibilidad de las herramientas diseñadas durante esta tésis para explorar estas aproximaciones y finalmente (c) El potencial de desarrollo futuro de accULL como herramienta experimental en OpenACC en base al rendimiento obtenido actualmente frente al rendimiento de otras aproximaciones comerciales

    Low-power System-on-Chip Processors for Energy Efficient High Performance Computing: The Texas Instruments Keystone II

    No full text
    The High Performance Computing (HPC) community recognizes energy consumption as a major problem. Extensive research is underway to identify means to increase energy efficiency of HPC systems including consideration of alternative building blocks for future systems. This thesis considers one such system, the Texas Instruments Keystone II, a heterogeneous Low-Power System-on-Chip (LPSoC) processor that combines a quad core ARM CPU with an octa-core Digital Signal Processor (DSP). It was first released in 2012. Four issues are considered: i) maximizing the Keystone II ARM CPU performance; ii) implementation and extension of the OpenMP programming model for the Keystone II; iii) simultaneous use of ARM and DSP cores across multiple Keystone SoCs; and iv) an energy model for applications running on LPSoCs like the Keystone II and heterogeneous systems in general. Maximizing the performance of the ARM CPU on the Keystone II system is fundamental to adoption of this system by the HPC community and, of the ARM architecture more broadly. Key to achieving good performance is exploitation of the ARM vector instructions. This thesis presents the first detailed comparison of the use of ARM compiler intrinsic functions with automatic compiler vectorization across four generations of ARM processors. Comparisons are also made with x86 based platforms and the use of equivalent Intel vector instructions. Implementation of the OpenMP programming model on the Keystone II system presents both challenges and opportunities. Challenges in that the OpenMP model was originally developed for a homogeneous programming environment with a common instruction set architecture, and in 2012 work had only just begun to consider how OpenMP might work with accelerators. Opportunities in that shared memory is accessible to all processing elements on the LPSoC, offering performance advantages over what typically exists with attached accelerators. This thesis presents an analysis of a prototype version of OpenMP implemented as a bare-metal runtime on the DSP of a Keystone I system. An implementation for the Keystone II that maps OpenMP 4.0 accelerator directives to OpenCL runtime library operations is presented and evaluated. Exploitation of some of the underlying hardware features of the Keystone II is also discussed. Simultaneous use of the ARM and DSP cores across multiple Keystone II boards is fundamental to the creation of commercially viable HPC offerings based on Keystone technology. The nCore BrownDwarf and HPE Moonshot systems represent two such systems. This thesis presents a proof-of-concept implementation of matrix multiplication (GEMM) for the BrownDwarf system. The BrownDwarf utilizes both Keystone II and Keystone I SoCs through a point-to-point interconnect called Hyperlink. Details of how a novel message passing communication framework across Hyperlink was implemented to support this complex environment are provided. An energy model that can be used to predict energy usage as a function of what fraction of a particular computation is performed on each of the available compute devices offers the opportunity for making runtime decisions on how best to minimize energy usage. This thesis presents a basic energy usage model that considers rates of executions on each device and their active and idle power usages. Using this model, it is shown that only under certain conditions does there exist an energy-optimal work partition that uses multiple compute devices. To validate the model a high resolution energy measurement environment is developed and used to gather energy measurements for a matrix multiplication benchmark running on a variety of systems. Results presented support the model. Drawing on the four issues noted above and other developments that have occurred since the Keystone II system was first announced, the thesis concludes by making comments regarding the future of LPSoCs as building blocks for HPC systems

    Modelli e strumenti di programmazione parallela per piattaforme many-core

    Get PDF
    The negotiation between power consumption, performance, programmability, and portability drives all computing industry designs, in particular the mobile and embedded systems domains. Two design paradigms have proven particularly promising in this context: architectural heterogeneity and many-core processors. Parallel programming models are key to effectively harness the computational power of heterogeneous many-core SoC. This thesis presents a set of techniques and HW/SW extensions that enable performance improvements and that simplify programmability for heterogeneous many-core platforms. The thesis contributions cover vertically the entire software stack for many-core platforms, from hardware abstraction layers running on top of bare-metal, to programming models; from hardware extensions for efficient parallelism support to middleware that enables optimized resource management within many-core platforms. First, we present mechanisms to decrease parallelism overheads on parallel programming runtimes for many-core platforms, targeting fine-grain parallelism. Second, we present programming model support that enables the offload of computational kernels within heterogeneous many-core systems. Third, we present a novel approach to dynamically sharing and managing many-core platforms when multiple applications coded with different programming models execute concurrently. All these contributions were validated using STMicroelectronics STHORM, a real embodiment of a state-of-the-art many-core system. Hardware extensions and architectural explorations were explored using VirtualSoC, a SystemC based cycle-accurate simulator of many-core platforms

    XcalableMP PGAS Programming Language

    Get PDF
    XcalableMP is a directive-based parallel programming language based on Fortran and C, supporting a Partitioned Global Address Space (PGAS) model for distributed memory parallel systems. This open access book presents XcalableMP language from its programming model and basic concept to the experience and performance of applications described in XcalableMP.  XcalableMP was taken as a parallel programming language project in the FLAGSHIP 2020 project, which was to develop the Japanese flagship supercomputer, Fugaku, for improving the productivity of parallel programing. XcalableMP is now available on Fugaku and its performance is enhanced by the Fugaku interconnect, Tofu-D. The global-view programming model of XcalableMP, inherited from High-Performance Fortran (HPF), provides an easy and useful solution to parallelize data-parallel programs with directives for distributed global array and work distribution and shadow communication. The local-view programming adopts coarray notation from Coarray Fortran (CAF) to describe explicit communication in a PGAS model. The language specification was designed and proposed by the XcalableMP Specification Working Group organized in the PC Consortium, Japan. The Omni XcalableMP compiler is a production-level reference implementation of XcalableMP compiler for C and Fortran 2008, developed by RIKEN CCS and the University of Tsukuba. The performance of the XcalableMP program was used in the Fugaku as well as the K computer. A performance study showed that XcalableMP enables a scalable performance comparable to the message passing interface (MPI) version with a clean and easy-to-understand programming style requiring little effort

    Run-time support for multi-level disjoint memory address spaces

    Get PDF
    High Performance Computing (HPC) systems have become widely used tools in many industry areas and research fields. Research to produce more powerful and efficient systems has grown in par with their popularity. As a consequence, the complexity of modern HPC architectures has increased in order to provide systems with the highest levels of performance. This increased complexity has also affected the way HPC systems are programmed. HPC users have to deal with new devices, languages and tools, and this is can be a significant access barrier to people that do not have a deep knowledge in computer science. On par with the evolution of HPC systems, programming models have also evolved to ease the task of developing applications for these machines. Two well-known examples are OpenMP and MPI. The former can be used in shared memory systems and is praised for offering an easy methodology of software development. The latter is more popular because it targets distributed environments but it is considered burdensome to use. Besides these two, many programming models have emerged to propose new methodologies or to handle new hardware devices. One of these models is OmpSs. OmpSs is a programming model for modern HPC systems that is based on OpenMP and StarSs. Developed by the Programming Models group at the Barcelona Supercomputing Center, it targets the latest generation of HPC systems while benefiting from the ease of use of OpenMP. OmpSs offers asynchronous parallelism with the concept of tasks with data dependencies. These tasks allow the specification of sections of code that can be executed in parallel while the dependencies specify the restrictions about the order in which the tasks can be executed. With this, OmpSs programs can adapt to a many different system configurations while fundamentally still being sequential programs with annotations. This thesis explores the benefits of providing OmpSs the capability to target architectures with complex memory hierarchies. An example of such systems can be the new generation of clusters that use accelerators to power their computing capabilities. The memory hierarchy of these machines is composed of a first level of distributed memory formed by the memory of each individual node, and a second level formed by the private memory of each accelerator devices. Our first contribution shows the implementation of the support of cluster of multi-cores for the OmpSs programming model. We also present two optimizations to boost the performance of applications running on top of cluster systems: a specific task scheduling policy and the addition of slave-to-slave transfers. We evaluate our implementation using a set of benchmarks coded in OmpSs and we also compare them against the same applications implemented using MPI, the most widely used programming model for these systems. We extend our initial implementation in our second contribution, which provides OmpSs with support for clusters of GPUs. We show that OmpSs programs targeting these complex systems are capable of achieving a good performance when compared against MPI+CUDA implementations. The third contribution of this thesis presents an implementation and evaluation of the performance and programmability impact of supporting non-contiguous memory regions. Offering this feature allows applications with complex data accesses to be easily annotated with OmpSs. This is important to widen the spectrum of applications that can be handled by the programming model.Els sistemes de computació d'altes prestacions (CAP) han esdevingut eines importants en diferents sectors industrials i camps de recerca. La recerca per produir sistemes més potents i eficients ha crescut proporcionalment a aquesta popularitat. Com a conseqüència, la complexitat d'aquest tipus de sistemes s'ha incrementat per tal de dotar-los d'altes prestacions. Aquest increment en la complexitat també ha afectat la manera de programar aquest tipus de sistemes. Els usuaris de sistemes CAP han de treballar amb nous dispositius, llenguatges i eines, i això pot convertir-se en una barrera d'entrada significativa per aquelles persones que no tinguin uns alts coneixements informàtics. Seguin l'evolució dels sistemes CAP, els models de programació també han evolucionat per tal de facilitar la tasca de desenvolupar aplicacions per aquests sistemes. Dos exemples ben coneguts son OpenMP i MPI. El primer es pot utilitzar en sistemes de memòria compartida i es reconegut per oferir una metodologia de desenvolupament senzilla. El segon és més popular perquè està dissenyat per sistemes distribuïts, però està considerat difícil d'utilitzar. A part d'aquests dos, altres models de programació han sorgit per proposar noves metodologies o per suportar nous components hardware. Un d'aquests nous models és OmpSs. OmpSs és un model de programació per sistemes CAP moderns que està basat en OpenMP i StarSs. Desenvolupat pel grup de Models de Programació del Barcelona Supercomputing Center, està dissenyat per suportar la darrera generació de sistemes CAP i alhora oferir la facilitat d'us d'OpenMP. OmpSs ofereix paral·lelisme asíncron mitjançant el concepte de tasques amb dependències de dades. Aquestes tasques permeten especificar regions de codi que poden ser executades en paral·lel, mentre que les dependències especifiquen les restriccions sobre l'ordre en que aquestes tasques poden ser executades. Amb això, els programes fets amb OmpSs poden adaptar-se a sistemes amb diferents configuracions tot i ser fonamentalment programes seqüencials amb anotacions. Aquesta tesi explora els beneficis de proveir a OmpSs amb la capacitat de funcionar sobre arquitectures amb jerarquies de memòria complexes. Un exemple d'un sistema així pot ser un dels clústers de nova generació que utilitzen acceleradors per tal d'oferir més capacitat de càlcul. La jerarquia de memòria en aquestes màquines està composada per un primer nivell de memòria distribuïda formada per la memòria de cada node individual, i el segon nivell està format per la memòria privada de cada accelerador. La primera contribució d'aquesta tesi mostra la implementació del suport de clústers de multi-cores pel model de programació OmpSs. També presentem dos optimitzacions per millorar el rendiment de les aplicacions quan s'executen en sistemes clúster: una política de planificació de tasques específica i la incorporació dels missatges entre nodes esclaus. Avaluem la nostra implementació usant un conjunt d'aplicacions programades en OmpSs i també les comparem amb les mateixes aplicacions implementades usant MPI, el model de programació més estès per aquest tipus de sistemes. En la segona contribució estenem la nostra implementació inicial per tal de dotar OmpSs de suport per clústers de GPUs. Mostrem que els programes OmpSs son capaços d'obtenir un bon rendiment sobre aquests tipus de sistemes, fins i tot quan els comparem amb versions implementades usant MPI+CUDA. La tercera contribució descriu la implementació i avaluació del rendiment i de l'impacte de suportar regions de memòria no contigües. Oferir aquesta funcionalitat permet implementar fàcilment amb OmpSs aplicacions amb accessos complexes a memòria, cosa que és important de cara a ampliar l'espectre d'aplicacions que poden ser tractades pel model de programació

    Optimisation of computational fluid dynamics applications on multicore and manycore architectures

    Get PDF
    This thesis presents a number of optimisations used for mapping the underlying computational patterns of finite volume CFD applications onto the architectural features of modern multicore and manycore processors. Their effectiveness and impact is demonstrated in a block-structured and an unstructured code of representative size to industrial applications and across a variety of processor architectures that make up contemporary high-performance computing systems. The importance of vectorization and the ways through which this can be achieved is demonstrated in both structured and unstructured solvers together with the impact that the underlying data layout can have on performance. The utility of auto-tuning for ensuring performance portability across multiple architectures is demonstrated and used for selecting optimal parameters such as prefetch distances for software prefetching or tile sizes for strip mining/loop tiling. On the manycore architectures, running more than one thread per physical core is found to be crucial for good performance on processors with in-order core designs but not required on out-of-order architectures. For architectures with high-bandwidth memory packages, their exploitation, whether explicitly or implicitly, is shown to be imperative for best performance. The implementation of all of these optimisations led to application speed-ups ranging between 2.7X and 3X on the multicore CPUs and 5.7X to 24X on the manycore processors.Open Acces

    Scientific Programming and Computer Architecture

    Get PDF
    A variety of programming models relevant to scientists explained, with an emphasis on how programming constructs map to parts of the computer.What makes computer programs fast or slow? To answer this question, we have to get behind the abstractions of programming languages and look at how a computer really works. This book examines and explains a variety of scientific programming models (programming models relevant to scientists) with an emphasis on how programming constructs map to different parts of the computer's architecture. Two themes emerge: program speed and program modularity. Throughout this book, the premise is to "get under the hood," and the discussion is tied to specific programs. The book digs into linkers, compilers, operating systems, and computer architecture to understand how the different parts of the computer interact with programs. It begins with a review of C/C++ and explanations of how libraries, linkers, and Makefiles work. Programming models covered include Pthreads, OpenMP, MPI, TCP/IP, and CUDA.The emphasis on how computers work leads the reader into computer architecture and occasionally into the operating system kernel. The operating system studied is Linux, the preferred platform for scientific computing. Linux is also open source, which allows users to peer into its inner workings. A brief appendix provides a useful table of machines used to time programs. The book's website (https://github.com/divakarvi/bk-spca) has all the programs described in the book as well as a link to the html text
    corecore