12 research outputs found

    Productive Programming Systems for Heterogeneous Supercomputers

    Get PDF
    The majority of today's scientific and data analytics workloads are still run on relatively energy inefficient, heavyweight, general-purpose processing cores, often referred to in the literature as latency-oriented architectures. The flexibility of these architectures and the programmer aids included (e.g. large and deep cache hierarchies, branch prediction logic, pre-fetch logic) makes them flexible enough to run a wide range of applications fast. However, we have started to see growth in the use of lightweight, simpler, energy-efficient, and functionally constrained cores. These architectures are commonly referred to as throughput-oriented. Within each shared memory node, the computational backbone of future throughput-oriented HPC machines will consist of large pools of lightweight cores. The first wave of throughput-oriented computing came in the mid 2000's with the use of GPUs for general-purpose and scientific computing. Today we are entering the second wave of throughput-oriented computing, with the introduction of NVIDIA Pascal GPUs, Intel Knights Landing Xeon Phi processors, the Epiphany Co-Processor, the Sunway MPP, and other throughput-oriented architectures that enable pre-exascale computing. However, while the majority of the FLOPS in designs for future HPC systems come from throughput-oriented architectures, they are still commonly paired with latency-oriented cores which handle management functions and lightweight/un-parallelizable computational kernels. Hence, most future HPC machines will be heterogeneous in their processing cores. However, the heterogeneity of future machines will not be limited to the processing elements. Indeed, heterogeneity will also exist in the storage, networking, memory, and software stacks of future supercomputers. As a result, it will be necessary to combine many different programming models and libraries in a single application. How to do so in a programmable and well-performing manner is an open research question. This thesis addresses this question using two approaches. First, we explore using managed runtimes on HPC platforms. As a result of their high-level programming models, these managed runtimes have a long history of supporting data analytics workloads on commodity hardware, but often come with overheads which make them less common in the HPC domain. Managed runtimes are also not supported natively on throughput-oriented architectures. Second, we explore the use of a modular programming model and work-stealing runtime to compose the programming and scheduling of multiple third-party HPC libraries. This approach leverages existing investment in HPC libraries, unifies the scheduling of work on a platform, and is designed to quickly support new programming model and runtime extensions. In support of these two approaches, this thesis also makes novel contributions in tooling for future supercomputers. We demonstrate the value of checkpoints as a software development tool on current and future HPC machines, and present novel techniques in performance prediction across heterogeneous cores

    Proceedings of the 7th International Conference on PGAS Programming Models

    Get PDF

    Data-centric Performance Measurement and Mapping for Highly Parallel Programming Models

    Get PDF
    Modern supercomputers have complex features: many hardware threads, deep memory hierarchies, and many co-processors/accelerators. Productively and effectively designing programs to utilize those hardware features is crucial in gaining the best performance. There are several highly parallel programming models in active development that allow programmers to write efficient code on those architectures. Performance profiling is a very important technique in the development to achieve the best performance. In this dissertation, I proposed a new performance measurement and mapping technique that can associate performance data with program variables instead of code blocks. To validate the applicability of my data-centric profiling idea, I designed and implemented a profiler for PGAS and CUDA. For PGAS, I developed ChplBlamer, for both single-node and multi-node Chapel programs. My tool also provides new features such as data-centric inter-node load imbalance identification. For CUDA, I developed CUDABlamer for GPU-accelerated applications. CUDABlamer also attributes performance data to program variables, which is a feature that was not found in any previous CUDA profilers. Directed by the insights from the tools, I optimized several widely-studied benchmarks and significantly improved program performance by a factor of up to 4x for Chapel and 47x for CUDA kernels

    Adaptive Data Migration in Load-Imbalanced HPC Applications

    Get PDF
    Distributed parallel applications need to maximize and maintain computer resource utilization and be portable across different machines. Balanced execution of some applications requires more effort than others because their data distribution changes over time. Data re-distribution at runtime requires elaborate schemes that are expensive and may benefit particular applications. This dissertation discusses a solution for HPX applications to monitor application execution with APEX and use AGAS migration to adaptively redistribute data and load balance applications at runtime to improve application performance and scaling behavior. This dissertation provides evidence for the practicality of using the Active Global Address Space as is proposed by the ParalleX model and implemented in HPX. It does so by using migration for the transparent moving of objects at runtime and using the Autonomic Performance Environment for eXascale library with experiments that run on homogeneous and heterogeneous machines at Louisiana State University, CSCS Swiss National Supercomputing Centre, and National Energy Research Scientific Computing Center

    Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

    Get PDF
    The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking, based on the amount of waiting time a program region caused on the critical path, highlights program regions that are relevant for program optimization. The scalability of the proposed performance analysis and its implementation is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis

    Evaluating technologies and techniques for transitioning hydrodynamics applications to future generations of supercomputers

    Get PDF
    Current supercomputer development trends present severe challenges for scientific codebases. Moore’s law continues to hold, however, power constraints have brought an end to Dennard scaling, forcing significant increases in overall concurrency. The performance imbalance between the processor and memory sub-systems is also increasing and architectures are becoming significantly more complex. Scientific computing centres need to harness more computational resources in order to facilitate new scientific insights and maintaining their codebases requires significant investments. Centres therefore have to decide how best to develop their applications to take advantage of future architectures. To prevent vendor "lock-in" and maximise investments, achieving portableperformance across multiple architectures is also a significant concern. Efficiently scaling applications will be essential for achieving improvements in science and the MPI (Message Passing Interface) only model is reaching its scalability limits. Hybrid approaches which utilise shared memory programming models are a promising approach for improving scalability. Additionally PGAS (Partitioned Global Address Space) models have the potential to address productivity and scalability concerns. Furthermore, OpenCL has been developed with the aim of enabling applications to achieve portable-performance across a range of heterogeneous architectures. This research examines approaches for achieving greater levels of performance for hydrodynamics applications on future supercomputer architectures. The development of a Lagrangian-Eulerian hydrodynamics application is presented together with its utility for conducting such research. Strategies for improving application performance, including PGAS- and hybrid-based approaches are evaluated at large node-counts on several state-of-the-art architectures. Techniques to maximise the performance and scalability of OpenMP-based hybrid implementations are presented together with an assessment of how these constructs should be combined with existing approaches. OpenCL is evaluated as an additional technology for implementing a hybrid programming model and improving performance-portability. To enhance productivity several tools for automatically hybridising applications and improving process-to-topology mappings are evaluated. Power constraints are starting to limit supercomputer deployments, potentially necessitating the use of more energy efficient technologies. Advanced processor architectures are therefore evaluated as future candidate technologies, together with several application optimisations which will likely be necessary. An FPGA-based solution is examined, including an analysis of how effectively it can be utilised via a high-level programming model, as an alternative to the specialist approaches which currently limit the applicability of this technology

    Fast and generic concurrent message-passing

    Get PDF
    Communication hardware and software have a significant impact on the performance of clusters and supercomputers. Message passing model and the Message-Passing Interface (MPI) is a widely used model of communications in the High-Performance Computing (HPC) community with great success. However, it has recently faced new challenges due to the emergence of many-core architecture and of programming models with dynamic task parallelism, assuming a large number of concurrent, light-weight threads. These applications come from important classes of applications such as graph and data analytics. Using MPI with these languages/runtimes is inefficient because MPI implementation is not able to perform well with threads. Using MPI as a communication middleware is also not efficient since MPI has to provide many abstractions that are not needed for many of the frameworks, thus having extra overheads. In this thesis, we studied MPI performance under the new assumptions. We identified several factors in the message-passing model which were inherently problematic for scalability and performance. Next, we analyzed the communication of a number of graph, threading and data-flow frameworks to identify generic patterns. We then proposed a low-level communication interface (LCI) to bridge the gap between communication architecture and runtime. The core of our idea is to attach to each message a few simple operations which fit better with the current hardware and can be implemented efficiently. We show that with only a few carefully chosen primitives and appropriate design, message-passing under this interface can easily outperform production MPI when running atop of multi-threaded environment. Further, using LCI is simple for various types of usage

    Towards the use of mini-applications in performance prediction and optimisation of production codes

    Get PDF
    Maintaining the performance of large scientific codes is a difficult task. To aid in this task a number of mini-applications have been developed that are more tract able to analyse than large-scale production codes, while retaining the performance characteristics of them. These “mini-apps” also enable faster hardware evaluation, and for sensitive commercial codes allow evaluation of code and system changes outside of access approval processes. Techniques for validating the representativeness of a mini-application to a target code are ultimately qualitative, requiring the researcher to decide whether the similarity is strong enough for the mini-application to be trusted to provide accurate predictions of the target performance. Little consideration is given to the sensitivity of those predictions to the few differences between the mini-application and its target, how those potentially-minor static differences may lead to each code responding very differently to a change in the computing environment. An existing mini-application, ‘Mini-HYDRA’, of a production CFD simulation code is reviewed. Arithmetic differences lead to divergence in intra-node performance scaling, so the developers had removed some arithmetic from Mini-HYDRA, but this breaks the simulation so limits numerical research. This work restores the arithmetic, repeating validation for similar performance scaling, achieving similar intra-node scaling performance whilst neither are memory-bound. MPI strong scaling functionality is also added, achieving very similar multi-node scaling performance. The arithmetic restoration inevitably leads to different memory-bounds, and also different and varied responses to changes in processor architecture or instruction set. A performance model is developed that predicts this difference in response, in terms of the arithmetic differences. It is supplemented by a new benchmark that measures the memory-bound of CFD loops. Together, they predict the strong scaling performance of a production ‘target’ code, with a mean error of 8.8% (s = 5.2%). Finally, the model is used to investigate limited speedup from vectorisation despite not being memory-bound. It identifies that instruction throughput is significantly reduced relative to serial counterparts, independent of data ordering in memory, indicating a bottleneck within the processor core

    Performance Analysis of Complex Shared Memory Systems

    Get PDF
    Systems for high performance computing are getting increasingly complex. On the one hand, the number of processors is increasing. On the other hand, the individual processors are getting more and more powerful. In recent years, the latter is to a large extent achieved by increasing the number of cores per processor. Unfortunately, scientific applications often fail to fully utilize the available computational performance. Therefore, performance analysis tools that help to localize and fix performance problems are indispensable. Large scale systems for high performance computing typically consist of multiple compute nodes that are connected via network. Performance analysis tools that analyze performance problems that arise from using multiple nodes are readily available. However, the increasing number of cores per processor that can be observed within the last decade represents a major change in the node architecture. Therefore, this work concentrates on the analysis of the node performance. The goal of this thesis is to improve the understanding of the achieved application performance on existing hardware. It can be observed that the scaling of parallel applications on multi-core processors differs significantly from the scaling on multiple processors. Therefore, the properties of shared resources in contemporary multi-core processors as well as remote accesses in multi-processor systems are investigated and their respective impact on the application performance is analyzed. As a first step, a comprehensive suite of highly optimized micro-benchmarks is developed. These benchmarks are able to determine the performance of memory accesses depending on the location and coherence state of the data. They are used to perform an in-depth analysis of the characteristics of memory accesses in contemporary multi-processor systems, which identifies potential bottlenecks. However, in order to localize performance problems, it also has to be determined to which extend the application performance is limited by certain resources. Therefore, a methodology to derive metrics for the utilization of individual components in the memory hierarchy as well as waiting times caused by memory accesses is developed in the second step. The approach is based on hardware performance counters, which record the number of certain hardware events. The developed micro-benchmarks are used to selectively stress individual components, which can be used to identify the events that provide a reasonable assessment for the utilization of the respective component and the amount of time that is spent waiting for memory accesses to complete. Finally, the knowledge gained from this process is used to implement a visualization of memory related performance issues in existing performance analysis tools. The results of the micro-benchmarks reveal that the increasing number of cores per processor and the usage of multiple processors per node leads to complex systems with vastly different performance characteristics of memory accesses depending on the location of the accessed data. Furthermore, it can be observed that the aggregated throughput of shared resources in multi-core processors does not necessarily scale linearly with the number of cores that access them concurrently, which limits the scalability of parallel applications. It is shown that the proposed methodology for the identification of meaningful hardware performance counters yields useful metrics for the localization of memory related performance limitations

    Comprendre et Guider la Gestion des Ressources de Calcul dans unContexte Multi-ModĂšles de Programmation

    Get PDF
    With the advent of multicore and manycore processors as buildingblocks of HPC supercomputers, many applications shift from relying solely on a distributed programming model (e.g., MPI) to mixing distributed and shared-memory models (e.g., MPI+OpenMP). This leads to a better exploitation of shared-memory communications and reduces the overall memory footprint.However, this evolution has a large impact on the software stack as applications’ developers do typically mix several programming models to scale over a largenumber of multicore nodes while coping with their hiearchical depth. Oneside effect of this programming approach is runtime stacking: mixing multiplemodels involve various runtime libraries to be alive at the same time. Dealing with different runtime systems may lead to a large number of execution flowsthat may not efficiently exploit the underlying resources.We first present a study of runtime stacking. It introduces stacking configurations and categories to describe how stacking can appear in applications.We explore runtime-stacking configurations (spatial and temporal) focusing on thread/process placement on hardware resources from different runtime libraries. We build this taxonomy based on the analysis of state-of-the-artruntime stacking and programming models.We then propose algorithms to detect the misuse of compute resources when running a hybrid parallel application. We have implemented these algorithms inside a dynamic tool, called the Overseer. This tool monitors applications,and outputs resource usage to the user with respect to the application timeline, focusing on overloading and underloading of compute resources.Finally, we propose a second external tool called Overmind, that monitors the thread/process management and (re)maps them to the underlyingcores taking into account the hardware topology and the application behavior. By capturing a global view of resource usage the Overmind adapts theprocess/thread placement, and aims at taking the best decision to enhance the use of each compute node inside a supercomputer. We demonstrate the relevance of our approach and show that our low-overhead implementation is able to achieve good performance even when running with configurations that would have ended up with bad resource usage.La simulation numĂ©rique reproduit les comportements physiquesque l’on peut observer dans la nature. Elle est utilisĂ©e pour modĂ©liser des phĂ©nomĂšnes complexes, impossible Ă  prĂ©dire ou rĂ©pliquer. Pour rĂ©soudre ces problĂšmes dans un temps raisonnable, nous avons recours au calcul haute performance (High Performance Computing ou HPC en anglais). Le HPC regroupe l’ensemble des techniques utilisĂ©es pour concevoir et utiliser les super calcula-teurs. Ces Ă©normes machines ont pour objectifs de calculer toujours plus vite,plus prĂ©cisĂ©ment et plus efficacement.Pour atteindre ces objectifs, les machines sont de plus en plus complexes. La tendance actuelle est d’augmenter le nombre cƓurs de calculs sur les processeurs,mais aussi d’augmenter le nombre de processeurs dans les machines. Les ma-chines deviennent de plus en hĂ©tĂ©rogĂšnes, avec de nombreux Ă©lĂ©ments diffĂ©rents Ă  utiliser en mĂȘme temps pour extraire le maximum de performances. Pour pallier ces difficultĂ©s, les dĂ©veloppeurs utilisent des modĂšles de programmation,dont le but est de simplifier l’utilisation de toutes ces ressources. Certains modĂšles, dits Ă  mĂ©moire distribuĂ©e (comme MPI), permettent d’abstraire l’envoi de messages entre les diffĂ©rents nƓuds de calculs, d’autres dits Ă  mĂ©moire partagĂ©e, permettent de simplifier et d’optimiser l’utilisation de la mĂ©moire partagĂ©e au sein des cƓurs de calcul.Cependant, ces Ă©volutions et cette complexification des supercalculateurs Ă  un large impact sur la pile logicielle. Il est dĂ©sormais nĂ©cessaire d’utiliser plusieurs modĂšles de programmation en mĂȘme temps dans les applications.Ceci affecte non seulement le dĂ©veloppement des codes de simulations, car les dĂ©veloppeurs doivent manipuler plusieurs modĂšles en mĂȘme temps, mais aussi les exĂ©cutions des simulations. Un effet de bord de cette approche de la programmation est l’empilement de modĂšles (‘Runtime Stacking’) : mĂ©langer plusieurs modĂšles implique que plusieurs bibliothĂšques fonctionnent en mĂȘme temps. GĂ©rer plusieurs bibliothĂšques peut mener Ă  un grand nombre de fils d’exĂ©cution utilisant les ressources sous-jacentes de maniĂšre non optimaleL’objectif de cette thĂšse est d’étudier l’empilement des modĂšles de programmation et d’optimiser l’utilisation de ressources de calculs par ces modĂšles au cours de l’exĂ©cution des simulations numĂ©riques. Nous avons dans un premier temps caractĂ©risĂ© les diffĂ©rentes maniĂšres de crĂ©er des codes de calcul mĂ©langeant plusieurs modĂšles. Nous avons Ă©galement Ă©tudiĂ© les diffĂ©rentes interactions que peuvent avoir ces modĂšles entre eux lors de l’exĂ©cution des simulations.De ces observations nous avons conçu des algorithmes permettant de dĂ©tecter des utilisations de ressources non optimales. Enfin, nous avons dĂ©veloppĂ© un outil permettant de diriger automatiquement l’utilisation des ressources par les diffĂ©rents modĂšles de programmation
    corecore