7 research outputs found

    Design and Implementation of Bandwidth-aware Memory Placement and Migration Policies for Heterogeneous Memory Systems

    Get PDF
    Department of Computer Science and EngineeringHeterogeneous memory systems are composed of several types of memory, and are used in various computing domains. Each memory node in heterogeneous memory systems has different characteristics and performances. A particularly significant difference can be found in access latency and memory bandwidth. Therefore, the heterogeneity between memories must be considered to utilize the performance of a heterogeneous memory system. However, most of the previous works did not consider the bandwidth difference of the memory nodes constituting a heterogeneous memory system. The present work proposes bandwidth-aware memory placement and migration policies to solve the problem caused by the bandwidth difference of the memory nodes in a heterogeneous memory system. We implement three bandwidth-aware memory placement policies and one bandwidth-aware migration policy on the Linux kernel, then quantitatively experiment on and evaluate them in real systems. In addition, we prove that our proposed bandwidth-aware memory placement and migration policies can achieve a higher performance compared to conventional memory placement and migration policies that do not consider the bandwidth differences between heterogeneous memory nodes.ope

    컨텍스트를 인지하는 객체 프로파일링 정보를 이용한 이기종 메모리 시스템에서의 객체 배치 시뮬레이션

    Get PDF
    학위논문 (석사)-- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2018. 2. 염헌영.Phase change memory (PCM) is one of the promising non-volatile memory (NVM) technologies since it provides both high capacity and low idle power consumption. However, relatively slow access latency is one of the major challenges in using PCM as main memory. Therefore, in recent researches, it is attempting to construct heterogeneous memory systems by combining such NVM with DRAM. One of the major problems with using those systems is placing the data in the appropriate type of memory. In this paper, we propose an object placement method to address data placement problem in heterogeneous memory systems. With context-aware object profile information, we could dynamically detect memory access patterns of objects and determine the proper memory to place the objects on. We demonstrate the effectiveness of the proposed method by simulating memory access latency and energy consumption using the four selected workloads of the SPEC benchmark.Chapter 1 Introduction 1 Chapter 2 Background and Motivation 3 2.1 Heterogeneous Memory Systems 3 2.2 Context-Aware Memory Profiling 4 2.3 Object Profiling and Placement 4 Chapter 3 Object Placement Modeling 7 3.1 Basic Assumptions 7 3.2 Latency Modeling 8 3.3 Energy Consumption Modeling 9 3.4 Idle Power Consumption Modeling 10 3.5 Object Placement Decision 10 Chapter 4 Simulation 12 4.1 Simulation Methodology 12 4.2 Program Profiling Results 13 4.3 Simulation of Latency 15 4.4 Simulation of Energy Consumption 16 4.5 Simulation of Idle Power Consumption 16 Chapter 5 Conclusion 21 Bibliography 22 초록 24Maste

    Evaluating Emerging CXL-enabled Memory Pooling for HPC Systems

    Full text link
    Current HPC systems provide memory resources that are statically configured and tightly coupled with compute nodes. However, workloads on HPC systems are evolving. Diverse workloads lead to a need for configurable memory resources to achieve high performance and utilization. In this study, we evaluate a memory subsystem design leveraging CXL-enabled memory pooling. Two promising use cases of composable memory subsystems are studied -- fine-grained capacity provisioning and scalable bandwidth provisioning. We developed an emulator to explore the performance impact of various memory compositions. We also provide a profiler to identify the memory usage patterns in applications and their optimization opportunities. Seven scientific and six graph applications are evaluated on various emulated memory configurations. Three out of seven scientific applications had less than 10% performance impact when the pooled memory backed 75% of their memory footprint. The results also show that a dynamically configured high-bandwidth system can effectively support bandwidth-intensive unstructured mesh-based applications like OpenFOAM. Finally, we identify interference through shared memory pools as a practical challenge for adoption on HPC systems.Comment: 10 pages, 13 figures. Accepted for publication in Workshop on Memory Centric High Performance Computing (MCHPC'22) at SC2

    Distributed shared memory on heterogeneous CPUs+GPUs platforms

    Get PDF
    Dissertação de mestrado em Engenharia InformáticaDesenvolver aplicações para plataformas heterogéneas pode dificultar significativamente o processo de codificação, visto que o uso de dispositivos de computação diferentes significa ter que lidar com arquiteturas diferentes, modelos de programação e organização de memória diversos, espaços de endereçamento de memória disjuntos, etc. Este documento propõe que o processo de desenvolvimento pode ser simpli- cado ao virtualizar um ambiente de memória partilhada tradicional em cima de um sistema de memoria heterogéneo distribuído e expondo um modelo de memória unificado ao programador. O sistema de memória liberta o programador da gestão manual dos dados e permite o uso de memória dinâmica acessível por todos os dispositivos. O sistema de memória proposto foi implementado e validado na frame- work GAMA usando três algoritmos para testar o sistema: SAXPY, simulação N-Body "all-pairs" e Barnes-Hut. Estes algoritmos foram usados para avaliar o desempenho e a escalabilidade da framework quando equipada com o sistema de memória proposto. Os resultados mostram que, de uma forma geral, o sistema de memória melhorou o desempenho de todos os algoritmos. O sistema de memória provou ser mais útil em algoritmos com uma alta razão de computação sobre acessos a memória e especialmente em algoritmos irregulares ao melhorar também a escalabilidade. O alocador de memória paralelo mostrou optimos resultados quando usado apenas no CPU, mas teve problemas na velocidade de alocação quando foram adicionados GPUs ao sistema.Developing for heterogeneous platforms can significantly complicate the coding process, since different processing devices mean different architectures, programming and memory models, disjoint address spaces and so on. This document proposes that the development process can be eased by virtualizing a traditional shared memory environment on top of the heterogeneous distributed system and exposing a unified memory model to the developer. The memory system frees the developer from having to manually manage data movements and allows the use of dynamic memory, accessible by all the devices. The proposed memory system was implemented and validated on the GAMA framework using three algorithm to benchmark the system: SAXPY, all-pairs N-Body simulation and Barnes-Hut N-Body simulation. These algorithms were used to evaluate the framework performance and scalability when paired with the proposed memory system. The results show that, overall, the memory system improved performance on all algorithm. The memory system proved most useful on algorithms with high ratio of computation over memory accesses by improving execution times and especially useful on irregular algorithms by improving also scalability. The parallel memory allocator showed great results when used only on CPU, but had speed issues when paring GPUs to the CPU

    Architecting heterogeneous memory systems with 3D die-stacked memory

    Get PDF
    The main objective of this research is to efficiently enable 3D die-stacked memory and heterogeneous memory systems. 3D die-stacking is an emerging technology that allows for large amounts of in-package high-bandwidth memory storage. Die-stacked memory has the potential to provide extraordinary performance and energy benefits for computing environments, from data-intensive to mobile computing. However, incorporating die-stacked memory into computing environments requires innovations across the system stack from hardware and software. This dissertation presents several architectural innovations to practically deploy die-stacked memory into a variety of computing systems. First, this dissertation proposes using die-stacked DRAM as a hardware-managed cache in a practical and efficient way. The proposed DRAM cache architecture employs two novel techniques: hit-miss speculation and self-balancing dispatch. The proposed techniques virtually eliminate the hardware overhead of maintaining a multi-megabytes SRAM structure, when scaling to gigabytes of stacked DRAM caches, and improve overall memory bandwidth utilization. Second, this dissertation proposes a DRAM cache organization that provides a high level of reliability for die-stacked DRAM caches in a cost-effective manner. The proposed DRAM cache uses error-correcting code (ECCs), strong checksums (CRCs), and dirty data duplication to detect and correct a wide range of stacked DRAM failures—from traditional bit errors to large-scale row, column, bank, and channel failures—within the constraints of commodity, non-ECC DRAM stacks. With only a modest performance degradation compared to a DRAM cache with no ECC support, the proposed organization can correct all single-bit failures, and 99.9993% of all row, column, and bank failures. Third, this dissertation proposes architectural mechanisms to use large, fast, on-chip memory structures as part of memory (PoM) seamlessly through the hardware. The proposed design achieves the performance benefit of on-chip memory caches without sacrificing a large fraction of total memory capacity to serve as a cache. To achieve this, PoM implements the ability to dynamically remap regions of memory based on their access patterns and expected performance benefits. Lastly, this dissertation explores a new usage model for die-stacked DRAM involving a hybrid of caching and virtual memory support. In the common case where system’s physical memory is not over-committed, die-stacked DRAM operates as a cache to provide performance and energy benefits to the system. However, when the workload’s active memory demands exceed the capacity of the physical memory, the proposed scheme dynamically converts the stacked DRAM cache into a fast swap device to avoid the otherwise grievous performance penalty of swapping to disk.Ph.D

    Energy-efficient architectures for chip-scale networks and memory systems using silicon-photonics technology

    Full text link
    Today's supercomputers and cloud systems run many data-centric applications such as machine learning, graph algorithms, and cognitive processing, which have large data footprints and complex data access patterns. With computational capacity of large-scale systems projected to rise up to 50GFLOPS/W, the target energy-per-bit budget for data movement is expected to reach as low as 0.1pJ/bit, assuming 200bits/FLOP for data transfers. This tight energy budget impacts the design of both chip-scale networks and main memory systems. Conventional electrical links used in chip-scale networks (0.5-3pJ/bit) and DRAM systems used in main memory (>30pJ/bit) fail to provide sustained performance at low energy budgets. This thesis builds on the promising research on silicon-photonic technology to design system architectures and system management policies for chip-scale networks and main memory systems. The adoption of silicon-photonic links as chip-scale networks, however, is hampered by the high sensitivity of optical devices towards thermal and process variations. These device sensitivities result in high power overheads at high-speed communications. Moreover, applications differ in their resource utilization, resulting in application-specific thermal profiles and bandwidth needs. Similarly, optically-controlled memory systems designed using conventional electrical-based architectures require additional circuitry for electrical-to-optical and optical-to-electrical conversions within memory. These conversions increase the energy and latency per memory access. Due to these issues, chip-scale networks and memory systems designed using silicon-photonics technology leave much of their benefits underutilized. This thesis argues for the need to rearchitect memory systems and redesign network management policies such that they are aware of the application variability and the underlying device characteristics of silicon-photonic technology. We claim that such a cross-layer design enables a high-throughput and energy-efficient unified silicon-photonic link and main memory system. This thesis undertakes the cross-layer design with silicon-photonic technology in two fronts. First, we study the varying network bandwidth requirements across different applications and also within a given application. To address this variability, we develop bandwidth allocation policies that account for application needs and device sensitivities to ensure power-efficient operation of silicon-photonic links. Second, we design a novel architecture of an optically-controlled main memory system that is directly interfaced with silicon-photonic links using a novel read and write access protocol. Such a system ensures low-energy and high-throughput access from the processor to a high-density memory. To further address the diversity in application memory characteristics, we explore heterogeneous memory systems with multiple memory modules that provide varied power-performance benefits. We design a memory management policy for such systems that allocates pages at the granularity of memory objects within an application
    corecore