1,074 research outputs found

    Adjacent LSTM-Based Page Scheduling for Hybrid DRAM/NVM Memory Systems

    Get PDF
    Recent advances in memory technologies have led to the rapid growth of hybrid systems that combine traditional DRAM and Non Volatile Memory (NVM) technologies, as the latter provide lower cost per byte, low leakage power and larger capacities than DRAM, while they can guarantee comparable access latency. Such kind of heterogeneous memory systems impose new challenges in terms of page placement and migration among the alternative technologies of the heterogeneous memory system. In this paper, we present a novel approach for efficient page placement on heterogeneous DRAM/NVM systems. We design an adjacent LSTM-based approach for page placement, which strongly relies on page accesses prediction, while sharing knowledge among pages with behavioral similarity. The proposed approach leads up to 65.5% optimized performance compared to existing approaches, while achieving near-optimal results and saving 20.2% energy consumption on average. Moreover, we propose a new page replacement policy, namely clustered-LRU, achieving up to 8.1% optimized performance, compared to the default Least Recently Used (LRU) policy

    Improving Performance and Flexibility of Fabric-Attached Memory Systems

    Get PDF
    As demands for memory-intensive applications continue to grow, the memory capacity of each computing node is expected to grow at a similar pace. In high-performance computing (HPC) systems, the memory capacity per compute node is decided upon the most demanding application that would likely run on such a system, and hence the average capacity per node in future HPC systems is expected to grow significantly. However, diverse applications run on HPC systems with different memory requirements and memory utilization can fluctuate widely from one application to another. Since memory modules are private for a corresponding computing node, a large percentage of the overall memory capacity will likely be underutilized, especially when there are many jobs with small memory footprints. Thus, as HPC systems are moving towards the exascale era, better utilization of memory is strongly desired. Moreover, as new memory technologies come on the market, the flexibility of upgrading memory and system updates becomes a major concern since memory modules are tightly coupled with the computing nodes. To address these issues, vendors are exploring fabric-attached memories (FAM) systems. In this type of system, resources are decoupled and are maintained independently. Such a design has driven technology providers to develop new protocols, such as cache-coherent interconnects and memory semantic fabrics, to connect various discrete resources and help users leverage advances in-memory technologies to satisfy growing memory and storage demands. Using these new protocols, FAM can be directly attached to a system interconnect and be easily integrated with a variety of processing elements (PEs). Moreover, systems that support FAM can be smoothly upgraded and allow multiple PEs to share the FAM memory pools using well-defined protocols. The sharing of FAM between PEs allows efficient data sharing, improves memory utilization, reduces cost by allowing flexible integration of different PEs and memory modules from several vendors, and makes it easier to upgrade the system. However, adopting FAM in HPC systems brings in new challenges. Since memory is disaggregated and is accessed through fabric networks, latency in accessing memory (efficiency) is a crucial concern. In addition, quality of service, security from neighbor nodes, coherency, and address translation overhead to access FAM are some of the problems that require rethinking for FAM systems. To this end, we study and discuss various challenges that need to be addressed in FAM systems. Firstly, we developed a simulating environment to mimic and analyze FAM systems. Further, we showcase our work in addressing the challenges to improve the performance and increase the feasibility of such systems; enforcing quality of service, providing page migration support, and enhancing security from malicious neighbor nodes

    RapidSwap: ํšจ์œจ์ ์ธ ๊ณ„์ธตํ˜• Far Memory

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021.8. Bernhard Egger.As computation responsibilities are transferred and migrated to cloud computing environments, cloud operators are facing more challenges to accommodate workloads provided by their customers. Modern applications typically require a massive amount of main memory. DRAM allows the robust delivery of data to processing entities in conventional node-centric architectures. However, physically expanding DRAM is impracticable due to hardware limits and cost. In this thesis, we present RapidSwap, an efficient hierarchical far memory that exploits phase-change memory (persistent memory) in data centers to present near-DRAM performance at a significantly lower total cost of ownership (TCO). RapidSwap migrates cold memory contents to slower and cheaper storage devices by exhibiting the memory access frequency of applications. Evaluated with several different real-world cloud benchmark scenarios, RapidSwap achieves a reduction of 20% in operating cost at minimal performance degradation and is 30% more cost-effective than pure DRAM solutions. RapidSwap exemplifies that sophisticated utilization of novel storage technologies can present significant TCO savings in cloud data centers.์ปดํ“จํŒ… ํ™˜๊ฒฝ์ด ํด๋ผ์šฐ๋“œ ํ™˜๊ฒฝ์„ ์ค‘์‹ฌ์œผ๋กœ ๋ณ€ํ™”ํ•˜๊ณ  ์žˆ์–ด ํด๋ผ์šฐ๋“œ ์ œ๊ณต์ž๋Š” ๊ณ ๊ฐ์ด ์ œ๊ณตํ•˜๋Š” ์›Œํฌ๋กœ๋“œ๋ฅผ ์ˆ˜์šฉํ•˜๊ธฐ ์œ„ํ•œ ๋‹ค์–‘ํ•œ ๋ฌธ์ œ์— ์ง๋ฉดํ•˜๊ณ  ์žˆ๋‹ค. ์˜ค๋Š˜๋‚  ์‘์šฉ ํ”„๋กœ๊ทธ๋žจ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋งŽ์€ ์–‘์˜ ๋ฉ”์ธ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์š”๊ตฌํ•œ๋‹ค. ๊ธฐ์กด ๋…ธ๋“œ ์ค‘์‹ฌ ์•„ํ‚คํ…์ฒ˜์—์„œ DRAM์„ ์‚ฌ์šฉํ•˜๋ฉด ๋น ๋ฅด๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ๋ฌผ๋ฆฌ์ ์œผ๋กœ DRAM์„ ์ผ์ • ์ˆ˜์ค€ ์ด์ƒ ํ™•์žฅํ•˜๋Š” ๊ฒƒ์€ ํ•˜๋“œ์›จ์–ด ์ œํ•œ๊ณผ ๋น„์šฉ์œผ๋กœ ์ธํ•ด ํ˜„์‹ค์ ์œผ๋กœ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” DRAM์— ๊ฐ€๊นŒ์šด ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•˜๋ฉด์„œ๋„ ์ด ์†Œ์œ  ๋น„์šฉ์„ ์ƒ๋‹นํžˆ ๋‚ฎ์ถ”๋Š” ํšจ์œจ์  far memory์ธ RapidSwap์„ ์ œ์‹œํ•˜์˜€๋‹ค. RapidSwap์€ ๋ฐ์ดํ„ฐ์„ผํ„ฐ ํ™˜๊ฒฝ์—์„œ ์ƒ๋ณ€ํ™” ๋ฉ”๋ชจ๋ฆฌ (phase-change memory; persistent memory)๋ฅผ ํ™œ์šฉํ•˜๋ฉฐ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋นˆ๋„๋ฅผ ์ถ”์ ํ•˜์—ฌ ์ž์ฃผ ์ ‘๊ทผ๋˜์ง€ ์•Š๋Š” ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋Š๋ฆฌ๊ณ  ์ €๋ ดํ•œ ์ €์žฅ์žฅ์น˜๋กœ ์ด์†กํ•˜์—ฌ ์ด๋ฅผ ๋‹ฌ์„ฑํ•œ๋‹ค. ์—ฌ๋Ÿฌ ์ €๋ช…ํ•œ ํด๋ผ์šฐ๋“œ ๋ฒค์น˜๋งˆํฌ ์‹œ๋‚˜๋ฆฌ์˜ค๋กœ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ, RapidSwap์€ ์ˆœ์ˆ˜ DRAM ๋Œ€๋น„ ์•ฝ 20%์˜ ์šด์˜ ๋น„์šฉ์„ ์ ˆ๊ฐํ•˜๋ฉฐ ์•ฝ 30%์˜ ๋น„์šฉ ํšจ์œจ์„ฑ์„ ์ง€๋‹Œ๋‹ค. RapidSwap์€ ์ƒˆ๋กœ์šด ์Šคํ† ๋ฆฌ์ง€ ๊ธฐ์ˆ ์„ ์ •๊ตํ•˜๊ฒŒ ํ™œ์šฉํ•˜๋ฉด ํด๋ผ์šฐ๋“œ ๋ฐ์ดํ„ฐ ์„ผํ„ฐ ํ™˜๊ฒฝ์—์„œ ์šด์˜๋น„์šฉ์„ ์ƒ๋‹นํžˆ ์ €๊ฐํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์‚ฌ์‹ค์„ ๋ณด์ธ๋‹ค.Chapter 1 Introduction 1 Chapter 2 Background 4 2.1 Tiered Storage 4 2.2 Trends in Storage Devices 5 2.3 Techniques Proposed to Lower Memory Pressure 5 2.3.1 Transparent Memory Compression 5 2.3.2 Far Memory 6 Chapter 3 Motivation 9 3.1 Limitations of Existing Techniques 9 3.2 Tiered Storage as a Promising Alternative 10 Chapter 4 RapidSwap Design and Implementation 12 4.1 RapidSwap Design 12 4.1.1 Storage Frontend 12 4.1.2 Storage Backend 15 4.2 RapidSwap Implementation 17 4.2.1 Swap Handler 17 4.2.2 Storage Frontend 18 4.2.3 Storage Backend 20 Chapter 5 Results 21 5.1 Experimental Setup 21 5.2 RapidSwap Performance 23 5.2.1 Degradation over DRAM 23 5.2.2 Tiered Storage Utilization 27 5.2.3 Hit/Miss Analysis 28 5.3 Cost of Storage Tier 29 5.4 Cost Effectiveness 30 Chapter 6 Conclusion and Future Work 32 6.1 Conclusion 32 6.2 Future Work 33 Bibliography 34 ์š”์•ฝ 39์„

    Design and Evaluation of a Rack-Scale Disaggregated Memory Architecture For Data Centers

    Full text link
    Memory disaggregation is being considered as a strong alternative to traditional architecture to deal with the memory under-utilization in data centers. Disaggregated memory can adapt to dynamically changing memory requirements for the data center applications like data analytics, big data, etc., that require in-memory processing. However, such systems can face high remote memory access latency due to the interconnect speeds. In this paper, we explore a rack-scale disaggregated memory architecture and discuss the various design aspects. We design a trace-driven simulator that combines an event-based interconnect and a cycle-accurate memory simulator to evaluate the performance of disaggregated memory system at the rack scale. Our study shows that not only the interconnect but the contention in the remote memory queues also adds significantly to remote memory access latency. We introduces a memory allocation policy to reduce the latency compared to the conventional policies. We conduct experiments using various benchmarks with diverse memory access patterns. Our study shows encouraging results towards the rack-scale memory disaggregation and acceptable average memory access latency

    Assise: Performance and Availability via NVM Colocation in a Distributed File System

    Full text link
    The adoption of very low latency persistent memory modules (PMMs) upends the long-established model of disaggregated file system access. Instead, by colocating computation and PMM storage, we can provide applications much higher I/O performance, sub-second application failover, and strong consistency. To demonstrate this, we built the Assise distributed file system, based on a persistent, replicated coherence protocol for managing a set of server-colocated PMMs as a fast, crash-recoverable cache between applications and slower disaggregated storage, such as SSDs. Unlike disaggregated file systems, Assise maximizes locality for all file IO by carrying out IO on colocated PMM whenever possible and minimizes coherence overhead by maintaining consistency at IO operation granularity, rather than at fixed block sizes. We compare Assise to Ceph/Bluestore, NFS, and Octopus on a cluster with Intel Optane DC PMMs and SSDs for common cloud applications and benchmarks, such as LevelDB, Postfix, and FileBench. We find that Assise improves write latency up to 22x, throughput up to 56x, fail-over time up to 103x, and scales up to 6x better than its counterparts, while providing stronger consistency semantics. Assise promises to beat the MinuteSort world record by 1.5x

    Adding Machine Intelligence to Hybrid Memory Management

    Get PDF
    Computing platforms increasingly incorporate heterogeneous memory hardware technologies, as a way to scale application performance, memory capacities and achieve cost effectiveness. However, this heterogeneity, along with the greater irregularity in the behavior of emerging workloads, render existing hybrid memory management approaches ineffective, calling for more intelligent methods. To this end, this thesis reveals new insights, develops novel methods and contributes system-level mechanisms towards the practical integration of machine learning to hybrid memory management, boosting application performance and system resource efficiency. First, this thesis builds Kleio; a hybrid memory page scheduler with machine intelligence. Kleio deploys Recurrent Neural Networks to learn memory access patterns at a page granularity and to improve upon the selection of dynamic page migrations across the memory hardware components. Kleio cleverly focuses the machine learning on the page subset whose timely movement will reveal most application performance improvement, while preserving history-based lightweight management for the rest of the pages. In this way, Kleio bridges on average 80% of the relative existing performance gap, while laying the grounds for practical machine intelligent data management with manageable learning overheads. In addition, this thesis contributes three system-level mechanisms to further boost application performance and reduce the operational and learning overheads of machine learning-based hybrid memory management. First, this thesis builds Cori; a system-level solution for tuning the operational frequency of periodic page schedulers for hybrid memories. Cori leverages insights on data reuse times to fine tune the page migration frequency in a lightweight manner. Second, this thesis contributes Coeus; a page grouping mechanism for page schedulers like Kleio. Coeus leverages Coriโ€™s data reuse insights to tune the granularity at which patterns are interpreted by the page scheduler and enable the training of a single Recurrent Neural Network per page cluster, reducing by 3x the model training times. The combined effects of Cori and Coeus provide 3x additional performance improvements to Kleio. Finally, this thesis proposes Cronus; an image-based page selector for page schedulers like Kleio. Cronus uses visualization to accelerate the process of selecting which page patterns should be managed with machine learning, reducing by 75x the operational overheads of Kleio. Cronus lays the foundations for future use of visualization and computer vision methods in memory management, such as image-based memory access pattern classification, recognition and prediction.Ph.D

    Evaluating Emerging CXL-enabled Memory Pooling for HPC Systems

    Full text link
    Current HPC systems provide memory resources that are statically configured and tightly coupled with compute nodes. However, workloads on HPC systems are evolving. Diverse workloads lead to a need for configurable memory resources to achieve high performance and utilization. In this study, we evaluate a memory subsystem design leveraging CXL-enabled memory pooling. Two promising use cases of composable memory subsystems are studied -- fine-grained capacity provisioning and scalable bandwidth provisioning. We developed an emulator to explore the performance impact of various memory compositions. We also provide a profiler to identify the memory usage patterns in applications and their optimization opportunities. Seven scientific and six graph applications are evaluated on various emulated memory configurations. Three out of seven scientific applications had less than 10% performance impact when the pooled memory backed 75% of their memory footprint. The results also show that a dynamically configured high-bandwidth system can effectively support bandwidth-intensive unstructured mesh-based applications like OpenFOAM. Finally, we identify interference through shared memory pools as a practical challenge for adoption on HPC systems.Comment: 10 pages, 13 figures. Accepted for publication in Workshop on Memory Centric High Performance Computing (MCHPC'22) at SC2
    • โ€ฆ
    corecore