Search CORE

1,074 research outputs found

Adjacent LSTM-Based Page Scheduling for Hybrid DRAM/NVM Memory Systems

Author: Katsaragakis Manolis
Masouros Dimosthenis
Papadopoulos Lazaros
Soudris Dimitrios
Stavrakakis Konstantinos
Publication venue: OASIcs - OpenAccess Series in Informatics. 14th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 12th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2023)
Publication date: 01/01/2023
Field of study

Recent advances in memory technologies have led to the rapid growth of hybrid systems that combine traditional DRAM and Non Volatile Memory (NVM) technologies, as the latter provide lower cost per byte, low leakage power and larger capacities than DRAM, while they can guarantee comparable access latency. Such kind of heterogeneous memory systems impose new challenges in terms of page placement and migration among the alternative technologies of the heterogeneous memory system. In this paper, we present a novel approach for efficient page placement on heterogeneous DRAM/NVM systems. We design an adjacent LSTM-based approach for page placement, which strongly relies on page accesses prediction, while sharing knowledge among pages with behavioral similarity. The proposed approach leads up to 65.5% optimized performance compared to existing approaches, while achieving near-optimal results and saving 20.2% energy consumption on average. Moreover, we propose a new page replacement policy, namely clustered-LRU, achieving up to 8.1% optimized performance, compared to the default Least Recently Used (LRU) policy

Dagstuhl Research Online Publication Server

Improving Performance and Flexibility of Fabric-Attached Memory Systems

Author: Kommareddy Vamsee Reddy
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2021
Field of study

As demands for memory-intensive applications continue to grow, the memory capacity of each computing node is expected to grow at a similar pace. In high-performance computing (HPC) systems, the memory capacity per compute node is decided upon the most demanding application that would likely run on such a system, and hence the average capacity per node in future HPC systems is expected to grow significantly. However, diverse applications run on HPC systems with different memory requirements and memory utilization can fluctuate widely from one application to another. Since memory modules are private for a corresponding computing node, a large percentage of the overall memory capacity will likely be underutilized, especially when there are many jobs with small memory footprints. Thus, as HPC systems are moving towards the exascale era, better utilization of memory is strongly desired. Moreover, as new memory technologies come on the market, the flexibility of upgrading memory and system updates becomes a major concern since memory modules are tightly coupled with the computing nodes. To address these issues, vendors are exploring fabric-attached memories (FAM) systems. In this type of system, resources are decoupled and are maintained independently. Such a design has driven technology providers to develop new protocols, such as cache-coherent interconnects and memory semantic fabrics, to connect various discrete resources and help users leverage advances in-memory technologies to satisfy growing memory and storage demands. Using these new protocols, FAM can be directly attached to a system interconnect and be easily integrated with a variety of processing elements (PEs). Moreover, systems that support FAM can be smoothly upgraded and allow multiple PEs to share the FAM memory pools using well-defined protocols. The sharing of FAM between PEs allows efficient data sharing, improves memory utilization, reduces cost by allowing flexible integration of different PEs and memory modules from several vendors, and makes it easier to upgrade the system. However, adopting FAM in HPC systems brings in new challenges. Since memory is disaggregated and is accessed through fabric networks, latency in accessing memory (efficiency) is a crucial concern. In addition, quality of service, security from neighbor nodes, coherency, and address translation overhead to access FAM are some of the problems that require rethinking for FAM systems. To this end, we study and discuss various challenges that need to be addressed in FAM systems. Firstly, we developed a simulating environment to mimic and analyze FAM systems. Further, we showcase our work in addressing the challenges to improve the performance and increase the feasibility of such systems; enforcing quality of service, providing page migration support, and enhancing security from malicious neighbor nodes

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

RapidSwap: 효율적인 계층형 Far Memory

Author: 김현익
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(석사) -- 서울대학교대학원 : 공과대학 컴퓨터공학부, 2021.8. Bernhard Egger.As computation responsibilities are transferred and migrated to cloud computing environments, cloud operators are facing more challenges to accommodate workloads provided by their customers. Modern applications typically require a massive amount of main memory. DRAM allows the robust delivery of data to processing entities in conventional node-centric architectures. However, physically expanding DRAM is impracticable due to hardware limits and cost. In this thesis, we present RapidSwap, an efficient hierarchical far memory that exploits phase-change memory (persistent memory) in data centers to present near-DRAM performance at a significantly lower total cost of ownership (TCO). RapidSwap migrates cold memory contents to slower and cheaper storage devices by exhibiting the memory access frequency of applications. Evaluated with several different real-world cloud benchmark scenarios, RapidSwap achieves a reduction of 20% in operating cost at minimal performance degradation and is 30% more cost-effective than pure DRAM solutions. RapidSwap exemplifies that sophisticated utilization of novel storage technologies can present significant TCO savings in cloud data centers.컴퓨팅 환경이 클라우드 환경을 중심으로 변화하고 있어 클라우드 제공자는 고객이 제공하는 워크로드를 수용하기 위한 다양한 문제에 직면하고 있다. 오늘날 응용 프로그램은 일반적으로 많은 양의 메인 메모리를 요구한다. 기존 노드 중심 아키텍처에서 DRAM을 사용하면 빠르게 데이터를 제공할 수 있다. 그러나, 물리적으로 DRAM을 일정 수준 이상 확장하는 것은 하드웨어 제한과 비용으로 인해 현실적으로 불가능하다. 본 논문에서는 DRAM에 가까운 성능을 제공하면서도 총 소유 비용을 상당히 낮추는 효율적 far memory인 RapidSwap을 제시하였다. RapidSwap은 데이터센터 환경에서 상변화 메모리 (phase-change memory; persistent memory)를 활용하며 어플리케이션의 메모리 접근 빈도를 추적하여 자주 접근되지 않는 메모리를 느리고 저렴한 저장장치로 이송하여 이를 달성한다. 여러 저명한 클라우드 벤치마크 시나리오로 평가한 결과, RapidSwap은 순수 DRAM 대비 약 20%의 운영 비용을 절감하며 약 30%의 비용 효율성을 지닌다. RapidSwap은 새로운 스토리지 기술을 정교하게 활용하면 클라우드 데이터 센터 환경에서 운영비용을 상당히 저감할 수 있다는 사실을 보인다.Chapter 1 Introduction 1 Chapter 2 Background 4 2.1 Tiered Storage 4 2.2 Trends in Storage Devices 5 2.3 Techniques Proposed to Lower Memory Pressure 5 2.3.1 Transparent Memory Compression 5 2.3.2 Far Memory 6 Chapter 3 Motivation 9 3.1 Limitations of Existing Techniques 9 3.2 Tiered Storage as a Promising Alternative 10 Chapter 4 RapidSwap Design and Implementation 12 4.1 RapidSwap Design 12 4.1.1 Storage Frontend 12 4.1.2 Storage Backend 15 4.2 RapidSwap Implementation 17 4.2.1 Swap Handler 17 4.2.2 Storage Frontend 18 4.2.3 Storage Backend 20 Chapter 5 Results 21 5.1 Experimental Setup 21 5.2 RapidSwap Performance 23 5.2.1 Degradation over DRAM 23 5.2.2 Tiered Storage Utilization 27 5.2.3 Hit/Miss Analysis 28 5.3 Cost of Storage Tier 29 5.4 Cost Effectiveness 30 Chapter 6 Conclusion and Future Work 32 6.1 Conclusion 32 6.2 Future Work 33 Bibliography 34 요약 39석

SNU Open Repository and Archive

Design and Evaluation of a Rack-Scale Disaggregated Memory Architecture For Data Centers

Author: Jose John
Puri Amit
Venkatesh Tamarapalli
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 08/04/2023
Field of study

Memory disaggregation is being considered as a strong alternative to traditional architecture to deal with the memory under-utilization in data centers. Disaggregated memory can adapt to dynamically changing memory requirements for the data center applications like data analytics, big data, etc., that require in-memory processing. However, such systems can face high remote memory access latency due to the interconnect speeds. In this paper, we explore a rack-scale disaggregated memory architecture and discuss the various design aspects. We design a trace-driven simulator that combines an event-based interconnect and a cycle-accurate memory simulator to evaluate the performance of disaggregated memory system at the rack scale. Our study shows that not only the interconnect but the contention in the remote memory queues also adds significantly to remote memory access latency. We introduces a memory allocation policy to reduce the latency compared to the conventional policies. We conduct experiments using various benchmarks with diverse memory access patterns. Our study shows encouraging results towards the rack-scale memory disaggregation and acceptable average memory access latency

arXiv.org e-Print Archive

Assise: Performance and Availability via NVM Colocation in a Distributed File System

Author: Anderson Thomas E.
Canini Marco
Kim Jongyul
Kostić Dejan
Kwon Youngjin
Peter Simon
Reda Waleed
Schuh Henry N.
Witchel Emmett
Publication venue
Publication date: 01/01/2020
Field of study

The adoption of very low latency persistent memory modules (PMMs) upends the long-established model of disaggregated file system access. Instead, by colocating computation and PMM storage, we can provide applications much higher I/O performance, sub-second application failover, and strong consistency. To demonstrate this, we built the Assise distributed file system, based on a persistent, replicated coherence protocol for managing a set of server-colocated PMMs as a fast, crash-recoverable cache between applications and slower disaggregated storage, such as SSDs. Unlike disaggregated file systems, Assise maximizes locality for all file IO by carrying out IO on colocated PMM whenever possible and minimizes coherence overhead by maintaining consistency at IO operation granularity, rather than at fixed block sizes. We compare Assise to Ceph/Bluestore, NFS, and Octopus on a cluster with Intel Optane DC PMMs and SSDs for common cloud applications and benchmarks, such as LevelDB, Postfix, and FileBench. We find that Assise improves write latency up to 22x, throughput up to 56x, fail-over time up to 103x, and scales up to 6x better than its counterparts, while providing stronger consistency semantics. Assise promises to beat the MinuteSort world record by 1.5x

arXiv.org e-Print Archive

Publikationer från KTH

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Adding Machine Intelligence to Hybrid Memory Management

Author: Doudali Thaleia-Dimitra
Publication venue: Georgia Institute of Technology
Publication date: 15/09/2021
Field of study

Computing platforms increasingly incorporate heterogeneous memory hardware technologies, as a way to scale application performance, memory capacities and achieve cost effectiveness. However, this heterogeneity, along with the greater irregularity in the behavior of emerging workloads, render existing hybrid memory management approaches ineffective, calling for more intelligent methods. To this end, this thesis reveals new insights, develops novel methods and contributes system-level mechanisms towards the practical integration of machine learning to hybrid memory management, boosting application performance and system resource efficiency. First, this thesis builds Kleio; a hybrid memory page scheduler with machine intelligence. Kleio deploys Recurrent Neural Networks to learn memory access patterns at a page granularity and to improve upon the selection of dynamic page migrations across the memory hardware components. Kleio cleverly focuses the machine learning on the page subset whose timely movement will reveal most application performance improvement, while preserving history-based lightweight management for the rest of the pages. In this way, Kleio bridges on average 80% of the relative existing performance gap, while laying the grounds for practical machine intelligent data management with manageable learning overheads. In addition, this thesis contributes three system-level mechanisms to further boost application performance and reduce the operational and learning overheads of machine learning-based hybrid memory management. First, this thesis builds Cori; a system-level solution for tuning the operational frequency of periodic page schedulers for hybrid memories. Cori leverages insights on data reuse times to fine tune the page migration frequency in a lightweight manner. Second, this thesis contributes Coeus; a page grouping mechanism for page schedulers like Kleio. Coeus leverages Cori’s data reuse insights to tune the granularity at which patterns are interpreted by the page scheduler and enable the training of a single Recurrent Neural Network per page cluster, reducing by 3x the model training times. The combined effects of Cori and Coeus provide 3x additional performance improvements to Kleio. Finally, this thesis proposes Cronus; an image-based page selector for page schedulers like Kleio. Cronus uses visualization to accelerate the process of selecting which page patterns should be managed with machine learning, reducing by 75x the operational overheads of Kleio. Cronus lays the foundations for future use of visualization and computer vision methods in memory management, such as image-based memory access pattern classification, recognition and prediction.Ph.D

Scholarly Materials And Research @ Georgia Tech

Evaluating Emerging CXL-enabled Memory Pooling for HPC Systems

Author: Gokhale Maya
Peng Ivy B.
Wahlgren Jacob
Publication venue
Publication date: 04/11/2022
Field of study

Current HPC systems provide memory resources that are statically configured and tightly coupled with compute nodes. However, workloads on HPC systems are evolving. Diverse workloads lead to a need for configurable memory resources to achieve high performance and utilization. In this study, we evaluate a memory subsystem design leveraging CXL-enabled memory pooling. Two promising use cases of composable memory subsystems are studied -- fine-grained capacity provisioning and scalable bandwidth provisioning. We developed an emulator to explore the performance impact of various memory compositions. We also provide a profiler to identify the memory usage patterns in applications and their optimization opportunities. Seven scientific and six graph applications are evaluated on various emulated memory configurations. Three out of seven scientific applications had less than 10% performance impact when the pooled memory backed 75% of their memory footprint. The results also show that a dynamically configured high-bandwidth system can effectively support bandwidth-intensive unstructured mesh-based applications like OpenFOAM. Finally, we identify interference through shared memory pools as a practical challenge for adoption on HPC systems.Comment: 10 pages, 13 figures. Accepted for publication in Workshop on Memory Centric High Performance Computing (MCHPC'22) at SC2

arXiv.org e-Print Archive