6 research outputs found

    Automating the application data placement in hybrid memory systems

    Get PDF
    Multi-tiered memory systems, such as those based on Intel® Xeon Phi™processors, are equipped with several memory tiers with different characteristics including, among others, capacity, access latency, bandwidth, energy consumption, and volatility. The proper distribution of the application data objects into the available memory layers is key to shorten the time– to–solution, but the way developers and end-users determine the most appropriate memory tier to place the application data objects has not been properly addressed to date.In this paper we present a novel methodology to build an extensible framework to automatically identify and place the application’s most relevant memory objects into the Intel Xeon Phi fast on-package memory. Our proposal works on top of inproduction binaries by first exploring the application behavior and then substituting the dynamic memory allocations. This makes this proposal valuable even for end-users who do not have the possibility of modifying the application source code. We demonstrate the value of a framework based in our methodology for several relevant HPC applications using different allocation strategies to help end-users improve performance with minimal intervention. The results of our evaluation reveal that our proposal is able to identify the key objects to be promoted into fast on-package memory in order to optimize performance, leading to even surpassing hardware-based solutions.This work has been performed in the Intel-BSC Exascale Lab. Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266. We would like to thank the Intel’s DCG HEAT team for allowing us to access their computational resources. We also want to acknowledge this team, especially Larry Meadows and Jason Sewall, as well as Pardo Keppel for the productive discussions. We thank Raphaël Léger for allowing us to access the MAXW-DGTD application and its input.Peer ReviewedPostprint (author's final draft

    Exposer les caractéristiques des architectures à mémoires hétérogènes aux applications parallèles

    Get PDF
    National audienceLa complexitat dels sistemes de memòria ha augmentat considerablement durant l’última dècada. En conseqüència, els supercomputadors inclouen memòries a diversos nivells, heterogènies i no uniformes, amb propietats significativament diferents. Els desenvolupadors d'aplicacions científiques s'enfronten a un gran repte: aprofitar el sistema de memòria de manera eficient per millorar el rendiment i la productivitat. En aquest treball, presentem una interfície per gestionar la complexitat del sistema de memòria, composta per un conjunt d’atributs de memòria i una API per expressar i gestionar aquestes diverses característiques mitjançant mètriques, per exemple, amplada de banda, latència i capacitat. Aquesta permet que els sistemes d’execució, biblioteques paral·leles i aplicacions científiques puguin seleccionar la memòria adequada expressant les seves necessitats per a cada assignació sense haver de modificar el codi de cada plataforma.The complexity of memory systems has increased considerably over the past decade. Consequently, the supercomputers include memories at several levels, heterogeneous and non-uniform, with significantly different properties. Developers of scientific applications face a huge challenge: to harness the memory system efficiently to improve performance and productivity. In this work, we present an interface to manage the complexity of the memory system, composed of a set of memory attributes and an API to express and manage these various characteristics using metrics, for example bandwidth, latency and capacity. It allows runtime systems, parallel libraries and scientific applications to select the appropriate memory by expressing their needs for each allocation without having to modify the code for each platform.La complejidad de los sistemas de memoria ha aumentado considerablemente en la última década. En consecuencia, las supercomputadoras incluyen memorias en varios niveles, heterogéneos y no uniformes, con propiedades significativamente diferentes. Los desarrolladores de aplicaciones científicas enfrentan un gran desafío: aprovechar el sistema de memoria de manera eficiente para mejorar el rendimiento y la productividad. En este trabajo, presentamos una interfaz para administrar la complejidad del sistema de memoria, compuesta por un conjunto de atributos de memoria y una API para expresar y administrar estas diversas características utilizando métricas, por ejemplo ancho de banda, latencia y capacidad. Esta permite que los sistemas en tiempo de ejecución, las bibliotecas paralelas y las aplicaciones científicas seleccionen la memoria adecuada al expresar sus necesidades para cada asignación sin tener que modificar el código para cada plataforma.La complexité des systèmes de mémoire a considérablement augmenté au cours de la dernière décennie. En conséquence, les supercalculateurs incluent des mémoires à plusieurs niveaux, hétérogènes et non uniformes, avec propriétés significativement différentes. Les développeurs d'applications scientifiques sont confrontés à un énorme défi : exploiter efficacement le système de mémoire pour améliorer les performances et la productivité. Dans ce travail, nous présentons une interface pour gérer la complexité du système de mé-moire, composée d'un ensemble d'attributs des mémoires et d'une API pour exprimer et gérer ces diverses caractéristiques à l'aide de métriques, par exemple la bande passante, la latence et la capacité. Elle permet aux supports exécutifs, aux bibliothèques parallèles et aux applications scientifiques de sélectionner la mémoire appropriée en exprimant leurs besoins pour chaque allocation sans avoir à modifier le code pour chaque plate-forme

    ecoHMEM: Improving object placement methodology for hybrid memory systems in HPC

    Get PDF
    Recent byte-addressable persistent memory (PMEM) technology offers capacities comparable to storage devices and access times much closer to DRAMs than other non-volatile memory technology. To palliate the large gap with DRAM performance, DRAM and PMEM are usually combined. Users have the choice to either manage the placement to different memory spaces by software or leverage the DRAM as a cache for the virtual address space of the PMEM. We present novel methodology for automatic object-level placement, including efficient runtime object matching and bandwidth-aware placement. Our experiments leveraging Intel® Optane™ Persistent Memory show from matching to greatly improved performance with respect to state-of-the-art software and hardware solutions, attaining over 2x runtime improvement in miniapplications and over 6% in OpenFOAM, a complex production application.This paper received funding from the Intel-BSC Exascale Laboratory SoW 5.1, the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No. 749516, the EPEEC project from the European Union’s Horizon 2020 research and innovation program under grant agreement No 801051, the DEEP-SEA project from the European Commission’s EuroHPC program under grant agreement 955606, and the Ministerio de Ciencia e Innovacion—Agencia Estatal de Investigación (PID2019-107255GB-C21/AEI/10.13039/501100011033).Peer ReviewedPostprint (author's final draft

    Runtime-guided management of stacked DRAM memories in task parallel programs

    Get PDF
    Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These memories provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient for modern HPC systems. For this reason, both stacked DRAM and off-chip memories are expected to co-exist in HPC architectures, giving raise to different approaches for architecting the stacked DRAM in the system. This paper proposes a runtime approach to transparently manage stacked DRAM memories in task-based programming models. In this approach the runtime system is in charge of copying the data accessed by the tasks to the stacked DRAM, without any complex hardware support nor modifications to the application code. To mitigate the cost of copying data between the stacked DRAM and the off-chip memory, the proposal includes an optimization to parallelize the copies across idle or additional helper threads. In addition, the runtime system is aware of the reuse pattern of the data accessed by the tasks, and can exploit this information to avoid unworthy copies of data to the stacked DRAM. Results on the Intel Knights Landing processor show that the proposed techniques achieve an average speedup of 14% against the state-of-the-art library to manage the stacked DRAM and 29% against a stacked DRAM architected as a hardware cache.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Union’s Horizon 2020 research and innovation programme (grant agreement 779877). M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104.Peer ReviewedPostprint (author's final draft

    Automating the application data placement in hybrid memory systems

    No full text
    Multi-tiered memory systems, such as those based on Intel® Xeon Phi™processors, are equipped with several memory tiers with different characteristics including, among others, capacity, access latency, bandwidth, energy consumption, and volatility. The proper distribution of the application data objects into the available memory layers is key to shorten the time– to–solution, but the way developers and end-users determine the most appropriate memory tier to place the application data objects has not been properly addressed to date.In this paper we present a novel methodology to build an extensible framework to automatically identify and place the application’s most relevant memory objects into the Intel Xeon Phi fast on-package memory. Our proposal works on top of inproduction binaries by first exploring the application behavior and then substituting the dynamic memory allocations. This makes this proposal valuable even for end-users who do not have the possibility of modifying the application source code. We demonstrate the value of a framework based in our methodology for several relevant HPC applications using different allocation strategies to help end-users improve performance with minimal intervention. The results of our evaluation reveal that our proposal is able to identify the key objects to be promoted into fast on-package memory in order to optimize performance, leading to even surpassing hardware-based solutions.This work has been performed in the Intel-BSC Exascale Lab. Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266. We would like to thank the Intel’s DCG HEAT team for allowing us to access their computational resources. We also want to acknowledge this team, especially Larry Meadows and Jason Sewall, as well as Pardo Keppel for the productive discussions. We thank Raphaël Léger for allowing us to access the MAXW-DGTD application and its input.Peer Reviewe
    corecore