65 research outputs found

    Design and Implementation of Bandwidth-aware Memory Placement and Migration Policies for Heterogeneous Memory Systems

    Get PDF
    Department of Computer Science and EngineeringHeterogeneous memory systems are composed of several types of memory, and are used in various computing domains. Each memory node in heterogeneous memory systems has different characteristics and performances. A particularly significant difference can be found in access latency and memory bandwidth. Therefore, the heterogeneity between memories must be considered to utilize the performance of a heterogeneous memory system. However, most of the previous works did not consider the bandwidth difference of the memory nodes constituting a heterogeneous memory system. The present work proposes bandwidth-aware memory placement and migration policies to solve the problem caused by the bandwidth difference of the memory nodes in a heterogeneous memory system. We implement three bandwidth-aware memory placement policies and one bandwidth-aware migration policy on the Linux kernel, then quantitatively experiment on and evaluate them in real systems. In addition, we prove that our proposed bandwidth-aware memory placement and migration policies can achieve a higher performance compared to conventional memory placement and migration policies that do not consider the bandwidth differences between heterogeneous memory nodes.ope

    컨텍스트를 인지하는 객체 프로파일링 정보를 이용한 이기종 메모리 시스템에서의 객체 배치 시뮬레이션

    Get PDF
    학위논문 (석사)-- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2018. 2. 염헌영.Phase change memory (PCM) is one of the promising non-volatile memory (NVM) technologies since it provides both high capacity and low idle power consumption. However, relatively slow access latency is one of the major challenges in using PCM as main memory. Therefore, in recent researches, it is attempting to construct heterogeneous memory systems by combining such NVM with DRAM. One of the major problems with using those systems is placing the data in the appropriate type of memory. In this paper, we propose an object placement method to address data placement problem in heterogeneous memory systems. With context-aware object profile information, we could dynamically detect memory access patterns of objects and determine the proper memory to place the objects on. We demonstrate the effectiveness of the proposed method by simulating memory access latency and energy consumption using the four selected workloads of the SPEC benchmark.Chapter 1 Introduction 1 Chapter 2 Background and Motivation 3 2.1 Heterogeneous Memory Systems 3 2.2 Context-Aware Memory Profiling 4 2.3 Object Profiling and Placement 4 Chapter 3 Object Placement Modeling 7 3.1 Basic Assumptions 7 3.2 Latency Modeling 8 3.3 Energy Consumption Modeling 9 3.4 Idle Power Consumption Modeling 10 3.5 Object Placement Decision 10 Chapter 4 Simulation 12 4.1 Simulation Methodology 12 4.2 Program Profiling Results 13 4.3 Simulation of Latency 15 4.4 Simulation of Energy Consumption 16 4.5 Simulation of Idle Power Consumption 16 Chapter 5 Conclusion 21 Bibliography 22 초록 24Maste

    Concurrence of form and function in developing networks and its role in synaptic pruning

    Get PDF
    A fundamental question in neuroscience is how structure and function of neural systems are related. We study this interplay by combining a familiar auto-associative neural network with an evolving mechanism for the birth and death of synapses. A feedback loop then arises leading to two qualitatively different types of behaviour. In one, the network structure becomes heterogeneous and dissasortative, and the system displays good memory performance; furthermore, the structure is optimised for the particular memory patterns stored during the process. In the other, the structure remains homogeneous and incapable of pattern retrieval. These findings provide an inspiring picture of brain structure and dynamics that is compatible with experimental results on early brain development, and may help to explain synaptic pruning. Other evolving networks—such as those of protein interactions—might share the basic ingredients for this feedback loop and other questions, and indeed many of their structural features are as predicted by our model.We are grateful for financial support from the Spanish MINECO (project of Excellence: FIS2017-84256-P) and from “Obra Social La Caixa”

    Learning to Rank Graph-based Application Objects on Heterogeneous Memories

    Full text link
    Persistent Memory (PMEM), also known as Non-Volatile Memory (NVM), can deliver higher density and lower cost per bit when compared with DRAM. Its main drawback is that it is typically slower than DRAM. On the other hand, DRAM has scalability problems due to its cost and energy consumption. Soon, PMEM will likely coexist with DRAM in computer systems but the biggest challenge is to know which data to allocate on each type of memory. This paper describes a methodology for identifying and characterizing application objects that have the most influence on the application's performance using Intel Optane DC Persistent Memory. In the first part of our work, we built a tool that automates the profiling and analysis of application objects. In the second part, we build a machine learning model to predict the most critical object within large-scale graph-based applications. Our results show that using isolated features does not bring the same benefit compared to using a carefully chosen set of features. By performing data placement using our predictive model, we can reduce the execution time degradation by 12\% (average) and 30\% (max) when compared to the baseline's approach based on LLC misses indicator

    Using Performance Attributes for Managing Heterogeneous Memory in HPC Applications

    Get PDF
    International audienceThe complexity of memory systems has increased considerably over the past decade. Supercomputers may now include several levels of heterogeneous and non-uniform memory, with significantly different properties in terms of performance, capacity, persistence, etc. Developers of scientific applications face a huge challenge: efficiently exploit the memory system to improve performance, but keep productivity high by using portable solutions. In this work, we present a new API and a method to manage the complexity of modern memory systems. Our portable and abstracted API is designed to identify memory kinds and describe hardware characteristics using metrics, for example bandwidth, latency and capacity. It allows runtime systems, parallel libraries, and scientific applications to select the appropriate memory by expressing their needs for each allocation without having to remodify the code for each platform. Furthermore we present a survey of existing ways to determine sensitivity of application buffers using static code analysis, profiling and benchmarking. We show in a use case that combining these approaches with our API indeed enables a portable and productive method to match application requirements and hardware memory characteristics

    Adjacent LSTM-Based Page Scheduling for Hybrid DRAM/NVM Memory Systems

    Get PDF
    Recent advances in memory technologies have led to the rapid growth of hybrid systems that combine traditional DRAM and Non Volatile Memory (NVM) technologies, as the latter provide lower cost per byte, low leakage power and larger capacities than DRAM, while they can guarantee comparable access latency. Such kind of heterogeneous memory systems impose new challenges in terms of page placement and migration among the alternative technologies of the heterogeneous memory system. In this paper, we present a novel approach for efficient page placement on heterogeneous DRAM/NVM systems. We design an adjacent LSTM-based approach for page placement, which strongly relies on page accesses prediction, while sharing knowledge among pages with behavioral similarity. The proposed approach leads up to 65.5% optimized performance compared to existing approaches, while achieving near-optimal results and saving 20.2% energy consumption on average. Moreover, we propose a new page replacement policy, namely clustered-LRU, achieving up to 8.1% optimized performance, compared to the default Least Recently Used (LRU) policy

    Understanding and Optimizing Serverless Workloads in CXL-Enabled Tiered Memory

    Full text link
    Recent Serverless workloads tend to be largescaled/CPU-memory intensive, such as DL, graph applications, that require dynamic memory-to-compute resources provisioning. Meanwhile, recent solutions seek to design page management strategies for multi-tiered memory systems, to efficiently run heavy workloads. Compute Express Link (CXL) is an ideal platform for serverless workloads runtime that offers a holistic memory namespace thanks to its cache coherent feature and large memory capacity. However, naively offloading Serverless applications to CXL brings substantial latencies. In this work, we first quantify CXL impacts on various Serverless applications. Second, we argue the opportunity of provisioning DRAM and CXL in a fine-grained, application-specific manner to Serverless workloads, by creating a shim layer to identify, and naively place hot regions to DRAM, while leaving cold/warm regions to CXL. Based on the observation, we finally propose the prototype of Porter, a middleware in-between modern Serverless architecture and CXL-enabled tiered memory system, to efficiently utilize memory resources, while saving costs

    Remote-scope Promotion: Clarified, Rectified, and Verified

    Get PDF
    Modern accelerator programming frameworks, such as OpenCL, organise threads into work-groups. Remote-scope promotion (RSP) is a language extension recently proposed by AMD researchers that is designed to enable applications, for the first time, both to optimise for the common case of intra-work-group communication (using memory scopes to provide consistency only within a work-group) and to allow occasional inter-work-group communication (as required, for instance, to support the popular load-balancing idiom of work stealing). We present the first formal, axiomatic memory model of OpenCL extended with RSP. We have extended the Herd memory model simulator with support for OpenCL kernels that exploit RSP, and used it to discover bugs in several litmus tests and a work-stealing queue, that have been used previously in the study of RSP. We have also formalised the proposed GPU implementation of RSP. The formalisation process allowed us to identify bugs in the description of RSP that could result in well-synchronised programs experiencing memory inconsistencies. We present and prove sound a new implementation of RSP that incorporates bug fixes and requires less non-standard hardware than the original implementation. This work, a collaboration between academia and industry, clearly demonstrates how, when designing hardware support for a new concurrent language feature, the early application of formal tools and techniques can help to prevent errors, such as those we have found, from making it into silicon

    Efficient Machine Learning on Heterogeneous Computing Systems through a Coordinated Runtime System

    Get PDF
    Department of Computer Science and EngineeringAs machine learning grows, a heterogeneous computing system is actively used for a solution to increase the efficiency of machine learning. Although there are the prior studies for improving the efficiency of machine learning, the runtime support for heterogeneous computing system remains unexplored field. Our paper presents CEML, which is a runtime system to enhance the efficiency of machine learning on heterogeneous computing systems. CEML characterizes the machine-learning application in terms of the performance and power consumption at runtime, builds accurate the estimation models that estimate the performance and power consumption of the machine-learning application. CEML dynamically adapts the heterogeneous computing system to the efficient system state estimated to enhance the efficiency while satisfying constraints. We demonstrate the effectiveness of CEML by the evaluation in terms of the accuracy of estimators, the energy efficiency, the re-adaptation functionality, and runtime overheads on two full heterogeneous computing systems.clos

    Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures

    Full text link
    Executing machine learning inference tasks on resource-constrained edge devices requires careful hardware-software co-design optimizations. Recent examples have shown how transformer-based deep neural network models such as ALBERT can be used to enable the execution of natural language processing (NLP) inference on mobile systems-on-chip housing custom hardware accelerators. However, while these existing solutions are effective in alleviating the latency, energy, and area costs of running single NLP tasks, achieving multi-task inference requires running computations over multiple variants of the model parameters, which are tailored to each of the targeted tasks. This approach leads to either prohibitive on-chip memory requirements or paying the cost of off-chip memory access. This paper proposes adapter-ALBERT, an efficient model optimization for maximal data reuse across different tasks. The proposed model's performance and robustness to data compression methods are evaluated across several language tasks from the GLUE benchmark. Additionally, we demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator to extrapolate performance, power, and area improvements over the execution of a traditional ALBERT model on the same hardware platform.Comment: 10 pages, 6 figures, 3 table
    corecore