Search CORE

166 research outputs found

Exploring Application Performance on Emerging Hybrid-Memory Supercomputers

Author: Gioiosa Roberto
Kestor Gokcen
Laure Erwin
Markidis Stefano
Peng Ivy Bo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/04/2017
Field of study

Next-generation supercomputers will feature more hierarchical and heterogeneous memory systems with different memory technologies working side-by-side. A critical question is whether at large scale existing HPC applications and emerging data-analytics workloads will have performance improvement or degradation on these systems. We propose a systematic and fair methodology to identify the trend of application performance on emerging hybrid-memory systems. We model the memory system of next-generation supercomputers as a combination of "fast" and "slow" memories. We then analyze performance and dynamic execution characteristics of a variety of workloads, from traditional scientific applications to emerging data analytics to compare traditional and hybrid-memory systems. Our results show that data analytics applications can clearly benefit from the new system design, especially at large scale. Moreover, hybrid-memory systems do not penalize traditional scientific applications, which may also show performance improvement.Comment: 18th International Conference on High Performance Computing and Communications, IEEE, 201

arXiv.org e-Print Archive

Crossref

Automating the application data placement in hybrid memory systems

Author: Hoppe Hans-Christian
Labarta Mancho Jesús José
Llort German
Mercadal Estanislao
Peña Antonio J.
Servat Harald
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

Multi-tiered memory systems, such as those based on Intel® Xeon Phi™processors, are equipped with several memory tiers with different characteristics including, among others, capacity, access latency, bandwidth, energy consumption, and volatility. The proper distribution of the application data objects into the available memory layers is key to shorten the time– to–solution, but the way developers and end-users determine the most appropriate memory tier to place the application data objects has not been properly addressed to date.In this paper we present a novel methodology to build an extensible framework to automatically identify and place the application’s most relevant memory objects into the Intel Xeon Phi fast on-package memory. Our proposal works on top of inproduction binaries by first exploring the application behavior and then substituting the dynamic memory allocations. This makes this proposal valuable even for end-users who do not have the possibility of modifying the application source code. We demonstrate the value of a framework based in our methodology for several relevant HPC applications using different allocation strategies to help end-users improve performance with minimal intervention. The results of our evaluation reveal that our proposal is able to identify the key objects to be promoted into fast on-package memory in order to optimize performance, leading to even surpassing hardware-based solutions.This work has been performed in the Intel-BSC Exascale Lab. Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266. We would like to thank the Intel’s DCG HEAT team for allowing us to access their computational resources. We also want to acknowledge this team, especially Larry Meadows and Jason Sewall, as well as Pardo Keppel for the productive discussions. We thank Raphaël Léger for allowing us to access the MAXW-DGTD application and its input.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Hybrid2: Combining Caching and Migration in Hybrid Memory Systems

Author: Papaefstathiou Vasileios
Petersen Moura Trancoso Pedro
Sourdis Ioannis
Vasilakis Evangelos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

This paper considers a hybrid memory system composed of memory technologies with different characteristics; in particular a small, near memory exhibiting high bandwidth, i.e., 3D-stacked DRAM, and a larger, far memory offering capacity at lower bandwidth, i.e., off-chip DRAM. In the past,the near memory of such a system has been used either as a DRAM cache or as part of a flat address space combined with a migration mechanism. Caches and migration offer different tradeoffs (between performance, main memory capacity, data transfer costs, etc.) and share similar challenges related todata-transfer granularity and metadata management. This paper proposes Hybrid2 , a new hybrid memory system architecture that combines a DRAM cache with a migration scheme. Hybrid 2 does not deny valuable capacity from the memory system because it uses only a small fraction of the near memory as a DRAM cache; 64MB in our experiments.It further leverages the DRAM cache as a staging area to select the data most suitable for migration. Finally, Hybrid2 alleviates the metadata overheads of both DRAM caches and migration using a common mechanism. Using near to far memory ratios of 1:16, 1:8 and 1:4 in our experiments, Hybrid2 on average outperforms current state-of-the-art migration schemes by 7.9%, 9.1% and 6.4%, respectively. In the same system configurations, compared to DRAM caches Hybrid2 gives away on average only 0.3%, 1.2%, and 5.3% of performance offering 5.9%, 12.1%, and 24.6% more main memory capacity, respectively

Crossref

Chalmers Research

High-performance dataflow computing in hybrid memory systems with UPC++ DepSpawn

Author: Andrade Diego
Fraguela Basilio B.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

[Abstract]: Dataflow computing is a very attractive paradigm for high-performance computing, given its ability to trigger computations as soon as their inputs are available. UPC++ DepSpawn is a novel task-based library that supports this model in hybrid shared/distributed memory systems on top of a Partitioned Global Address Space environment. While the initial version of the library provided good results, it suffered from a key restriction that heavily limited its performance and scalability. Namely, each process had to consider all the tasks in the application rather than only those of interest to it, an overhead that naturally grows with both the number of processes and tasks in the system. In this paper, this restriction is lifted, enabling our library to provide higher levels of performance. This way, in experiments using 768 cores the performance improved up to 40.1%, the average improvement being 16.1%.Ministerio de Ciencia e Innovación; TIN2016-75845-PMinisterio de Ciencia e Innovación; PID2019-104184RB-I00Ministerio de Ciencia e Innovación; 10.13039/501100011033Xunta de Galicia; ED431C 2017/0

Repositorio da Universidade da Coruña

ecoHMEM: Improving object placement methodology for hybrid memory systems in HPC

Author: Ayguadé Parra Eduard
Jordà Peroliu Marc
Labarta Mancho Jesús José
Peña Monferrer Antonio José
Rai Siddharth
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

Recent byte-addressable persistent memory (PMEM) technology offers capacities comparable to storage devices and access times much closer to DRAMs than other non-volatile memory technology. To palliate the large gap with DRAM performance, DRAM and PMEM are usually combined. Users have the choice to either manage the placement to different memory spaces by software or leverage the DRAM as a cache for the virtual address space of the PMEM. We present novel methodology for automatic object-level placement, including efficient runtime object matching and bandwidth-aware placement. Our experiments leveraging Intel® Optane™ Persistent Memory show from matching to greatly improved performance with respect to state-of-the-art software and hardware solutions, attaining over 2x runtime improvement in miniapplications and over 6% in OpenFOAM, a complex production application.This paper received funding from the Intel-BSC Exascale Laboratory SoW 5.1, the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska-Curie grant agreement No. 749516, the EPEEC project from the European Union’s Horizon 2020 research and innovation program under grant agreement No 801051, the DEEP-SEA project from the European Commission’s EuroHPC program under grant agreement 955606, and the Ministerio de Ciencia e Innovacion—Agencia Estatal de Investigación (PID2019-107255GB-C21/AEI/10.13039/501100011033).Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC