7 research outputs found
Design and Implementation of Bandwidth-aware Memory Placement and Migration Policies for Heterogeneous Memory Systems
Department of Computer Science and EngineeringHeterogeneous memory systems are composed of several types of memory, and are used in various
computing domains. Each memory node in heterogeneous memory systems has different characteristics
and performances. A particularly significant difference can be found in access latency and memory
bandwidth. Therefore, the heterogeneity between memories must be considered to utilize the performance
of a heterogeneous memory system. However, most of the previous works did not consider the
bandwidth difference of the memory nodes constituting a heterogeneous memory system.
The present work proposes bandwidth-aware memory placement and migration policies to solve the
problem caused by the bandwidth difference of the memory nodes in a heterogeneous memory system.
We implement three bandwidth-aware memory placement policies and one bandwidth-aware migration
policy on the Linux kernel, then quantitatively experiment on and evaluate them in real systems. In
addition, we prove that our proposed bandwidth-aware memory placement and migration policies can
achieve a higher performance compared to conventional memory placement and migration policies that
do not consider the bandwidth differences between heterogeneous memory nodes.ope
컨텍스트를 인지하는 객체 프로파일링 정보를 이용한 이기종 메모리 시스템에서의 객체 배치 시뮬레이션
학위논문 (석사)-- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2018. 2. 염헌영.Phase change memory (PCM) is one of the promising non-volatile memory (NVM) technologies since it provides both high capacity and low idle power consumption. However, relatively slow access latency is one of the major challenges in using PCM as main memory. Therefore, in recent researches, it is attempting to construct heterogeneous memory systems by combining such NVM with
DRAM. One of the major problems with using those systems is placing the data in the appropriate type of memory. In this paper, we propose an object placement method to address data placement problem in heterogeneous memory
systems. With context-aware object profile information, we could dynamically detect memory access patterns of objects and determine the proper memory to place the objects on. We demonstrate the effectiveness of the proposed method by simulating memory access latency and energy consumption using the four selected workloads of the SPEC benchmark.Chapter 1 Introduction 1
Chapter 2 Background and Motivation 3
2.1 Heterogeneous Memory Systems 3
2.2 Context-Aware Memory Profiling 4
2.3 Object Profiling and Placement 4
Chapter 3 Object Placement Modeling 7
3.1 Basic Assumptions 7
3.2 Latency Modeling 8
3.3 Energy Consumption Modeling 9
3.4 Idle Power Consumption Modeling 10
3.5 Object Placement Decision 10
Chapter 4 Simulation 12
4.1 Simulation Methodology 12
4.2 Program Profiling Results 13
4.3 Simulation of Latency 15
4.4 Simulation of Energy Consumption 16
4.5 Simulation of Idle Power Consumption 16
Chapter 5 Conclusion 21
Bibliography 22
초록 24Maste
Evaluating Emerging CXL-enabled Memory Pooling for HPC Systems
Current HPC systems provide memory resources that are statically configured
and tightly coupled with compute nodes. However, workloads on HPC systems are
evolving. Diverse workloads lead to a need for configurable memory resources to
achieve high performance and utilization. In this study, we evaluate a memory
subsystem design leveraging CXL-enabled memory pooling. Two promising use cases
of composable memory subsystems are studied -- fine-grained capacity
provisioning and scalable bandwidth provisioning. We developed an emulator to
explore the performance impact of various memory compositions. We also provide
a profiler to identify the memory usage patterns in applications and their
optimization opportunities. Seven scientific and six graph applications are
evaluated on various emulated memory configurations. Three out of seven
scientific applications had less than 10% performance impact when the pooled
memory backed 75% of their memory footprint. The results also show that a
dynamically configured high-bandwidth system can effectively support
bandwidth-intensive unstructured mesh-based applications like OpenFOAM.
Finally, we identify interference through shared memory pools as a practical
challenge for adoption on HPC systems.Comment: 10 pages, 13 figures. Accepted for publication in Workshop on Memory
Centric High Performance Computing (MCHPC'22) at SC2
Distributed shared memory on heterogeneous CPUs+GPUs platforms
Dissertação de mestrado em Engenharia InformáticaDesenvolver aplicações para plataformas heterogéneas pode dificultar
significativamente o processo de codificação, visto que o uso de dispositivos
de computação diferentes significa ter que lidar com arquiteturas
diferentes, modelos de programação e organização de memória
diversos, espaços de endereçamento de memória disjuntos, etc. Este
documento propõe que o processo de desenvolvimento pode ser simpli-
cado ao virtualizar um ambiente de memória partilhada tradicional
em cima de um sistema de memoria heterogéneo distribuído e expondo
um modelo de memória unificado ao programador. O sistema
de memória liberta o programador da gestão manual dos dados e permite o uso de memória dinâmica acessível por todos os dispositivos.
O sistema de memória proposto foi implementado e validado na frame-
work GAMA usando três algoritmos para testar o sistema: SAXPY,
simulação N-Body "all-pairs" e Barnes-Hut. Estes algoritmos foram
usados para avaliar o desempenho e a escalabilidade da framework
quando equipada com o sistema de memória proposto.
Os resultados mostram que, de uma forma geral, o sistema de memória
melhorou o desempenho de todos os algoritmos. O sistema de memória
provou ser mais útil em algoritmos com uma alta razão de computação
sobre acessos a memória e especialmente em algoritmos irregulares ao
melhorar também a escalabilidade. O alocador de memória paralelo
mostrou optimos resultados quando usado apenas no CPU, mas teve
problemas na velocidade de alocação quando foram adicionados GPUs
ao sistema.Developing for heterogeneous platforms can significantly complicate the coding process, since different processing devices mean different architectures, programming and memory models, disjoint address spaces and so on. This document proposes that the development process can be eased by virtualizing a traditional shared memory environment on top of the heterogeneous distributed system and exposing a unified memory model to the developer. The memory system frees the developer from having to manually manage data movements and allows the use of dynamic memory, accessible by all the devices.
The proposed memory system was implemented and validated on the GAMA framework using three algorithm to benchmark the system: SAXPY, all-pairs N-Body simulation and Barnes-Hut N-Body simulation. These algorithms were used to evaluate the framework performance and scalability when paired with the proposed memory system.
The results show that, overall, the memory system improved performance on all algorithm. The memory system proved most useful on algorithms with high ratio of computation over memory accesses by improving execution times and especially useful on irregular algorithms by improving also scalability. The parallel memory allocator showed great results when used only on CPU, but had speed issues when paring GPUs to the CPU
Architecting heterogeneous memory systems with 3D die-stacked memory
The main objective of this research is to efficiently enable 3D die-stacked memory and heterogeneous memory systems. 3D die-stacking is an emerging technology that allows for large amounts of in-package high-bandwidth memory storage. Die-stacked memory has the potential to provide extraordinary performance and energy benefits for computing environments, from data-intensive to mobile computing. However, incorporating die-stacked memory into computing environments requires innovations across the system stack from hardware and software. This dissertation presents several architectural innovations to practically deploy die-stacked memory into a variety of computing systems.
First, this dissertation proposes using die-stacked DRAM as a hardware-managed cache in a practical and efficient way. The proposed DRAM cache architecture employs two novel techniques: hit-miss speculation and self-balancing dispatch. The proposed techniques virtually eliminate the hardware overhead of maintaining a multi-megabytes SRAM structure, when scaling to gigabytes of stacked DRAM caches, and improve overall memory bandwidth utilization.
Second, this dissertation proposes a DRAM cache organization that provides a high level of reliability for die-stacked DRAM caches in a cost-effective manner. The proposed DRAM cache uses error-correcting code (ECCs), strong checksums (CRCs), and dirty data duplication to detect and correct a wide range of stacked DRAM failures—from traditional bit errors to large-scale row, column, bank, and channel failures—within the constraints of commodity, non-ECC DRAM stacks. With only a modest performance degradation compared to a DRAM cache with no ECC support, the proposed organization can correct all single-bit failures, and 99.9993% of all row, column, and bank failures.
Third, this dissertation proposes architectural mechanisms to use large, fast, on-chip memory structures as part of memory (PoM) seamlessly through the hardware. The proposed design achieves the performance benefit of on-chip memory caches without sacrificing a large fraction of total memory capacity to serve as a cache. To achieve this, PoM implements the ability to dynamically remap regions of memory based on their access patterns and expected performance benefits.
Lastly, this dissertation explores a new usage model for die-stacked DRAM involving a hybrid of caching and virtual memory support. In the common case where system’s physical memory is not over-committed, die-stacked DRAM operates as a cache to provide performance and energy benefits to the system. However, when the workload’s active memory demands exceed the capacity of the physical memory, the proposed scheme dynamically converts the stacked DRAM cache into a fast swap device to avoid the otherwise grievous performance penalty of swapping to disk.Ph.D
Energy-efficient architectures for chip-scale networks and memory systems using silicon-photonics technology
Today's supercomputers and cloud systems run many data-centric applications such as machine learning, graph algorithms, and cognitive processing, which have large data footprints and complex data access patterns. With computational capacity of large-scale systems projected to rise up to 50GFLOPS/W, the target energy-per-bit budget for data movement is expected to reach as low as 0.1pJ/bit, assuming 200bits/FLOP for data transfers. This tight energy budget impacts the design of both chip-scale networks and main memory systems. Conventional electrical links used in chip-scale networks (0.5-3pJ/bit) and DRAM systems used in main memory (>30pJ/bit) fail to provide sustained performance at low energy budgets. This thesis builds on the promising research on silicon-photonic technology to design system architectures and system management policies for chip-scale networks and main memory systems. The adoption of silicon-photonic links as chip-scale networks, however, is hampered by the high sensitivity of optical devices towards thermal and process variations. These device sensitivities result in high power overheads at high-speed communications. Moreover, applications differ in their resource utilization, resulting in application-specific thermal profiles and bandwidth needs. Similarly, optically-controlled memory systems designed using conventional electrical-based architectures require additional circuitry for electrical-to-optical and optical-to-electrical conversions within memory. These conversions increase the energy and latency per memory access. Due to these issues, chip-scale networks and memory systems designed using silicon-photonics technology leave much of their benefits underutilized.
This thesis argues for the need to rearchitect memory systems and redesign network management policies such that they are aware of the application variability and the underlying device characteristics of silicon-photonic technology. We claim that such a cross-layer design enables a high-throughput and energy-efficient unified silicon-photonic link and main memory system. This thesis undertakes the cross-layer design with silicon-photonic technology in two fronts. First, we study the varying network bandwidth requirements across different applications and also within a given application. To address this variability, we develop bandwidth allocation policies that account for application needs and device sensitivities to ensure power-efficient operation of silicon-photonic links. Second, we design a novel architecture of an optically-controlled main memory system that is directly interfaced with silicon-photonic links using a novel read and write access protocol. Such a system ensures low-energy and high-throughput access from the processor to a high-density memory. To further address the diversity in application memory characteristics, we explore heterogeneous memory systems with multiple memory modules that provide varied power-performance benefits. We design a memory management policy for such systems that allocates pages at the granularity of memory objects within an application