65 research outputs found
Design and Implementation of Bandwidth-aware Memory Placement and Migration Policies for Heterogeneous Memory Systems
Department of Computer Science and EngineeringHeterogeneous memory systems are composed of several types of memory, and are used in various
computing domains. Each memory node in heterogeneous memory systems has different characteristics
and performances. A particularly significant difference can be found in access latency and memory
bandwidth. Therefore, the heterogeneity between memories must be considered to utilize the performance
of a heterogeneous memory system. However, most of the previous works did not consider the
bandwidth difference of the memory nodes constituting a heterogeneous memory system.
The present work proposes bandwidth-aware memory placement and migration policies to solve the
problem caused by the bandwidth difference of the memory nodes in a heterogeneous memory system.
We implement three bandwidth-aware memory placement policies and one bandwidth-aware migration
policy on the Linux kernel, then quantitatively experiment on and evaluate them in real systems. In
addition, we prove that our proposed bandwidth-aware memory placement and migration policies can
achieve a higher performance compared to conventional memory placement and migration policies that
do not consider the bandwidth differences between heterogeneous memory nodes.ope
컨텍스트를 인지하는 객체 프로파일링 정보를 이용한 이기종 메모리 시스템에서의 객체 배치 시뮬레이션
학위논문 (석사)-- 서울대학교 대학원 : 공과대학 컴퓨터공학부, 2018. 2. 염헌영.Phase change memory (PCM) is one of the promising non-volatile memory (NVM) technologies since it provides both high capacity and low idle power consumption. However, relatively slow access latency is one of the major challenges in using PCM as main memory. Therefore, in recent researches, it is attempting to construct heterogeneous memory systems by combining such NVM with
DRAM. One of the major problems with using those systems is placing the data in the appropriate type of memory. In this paper, we propose an object placement method to address data placement problem in heterogeneous memory
systems. With context-aware object profile information, we could dynamically detect memory access patterns of objects and determine the proper memory to place the objects on. We demonstrate the effectiveness of the proposed method by simulating memory access latency and energy consumption using the four selected workloads of the SPEC benchmark.Chapter 1 Introduction 1
Chapter 2 Background and Motivation 3
2.1 Heterogeneous Memory Systems 3
2.2 Context-Aware Memory Profiling 4
2.3 Object Profiling and Placement 4
Chapter 3 Object Placement Modeling 7
3.1 Basic Assumptions 7
3.2 Latency Modeling 8
3.3 Energy Consumption Modeling 9
3.4 Idle Power Consumption Modeling 10
3.5 Object Placement Decision 10
Chapter 4 Simulation 12
4.1 Simulation Methodology 12
4.2 Program Profiling Results 13
4.3 Simulation of Latency 15
4.4 Simulation of Energy Consumption 16
4.5 Simulation of Idle Power Consumption 16
Chapter 5 Conclusion 21
Bibliography 22
초록 24Maste
Concurrence of form and function in developing networks and its role in synaptic pruning
A fundamental question in neuroscience is how structure and function of neural systems are
related. We study this interplay by combining a familiar auto-associative neural network with
an evolving mechanism for the birth and death of synapses. A feedback loop then arises
leading to two qualitatively different types of behaviour. In one, the network structure
becomes heterogeneous and dissasortative, and the system displays good memory performance;
furthermore, the structure is optimised for the particular memory patterns stored
during the process. In the other, the structure remains homogeneous and incapable of pattern
retrieval. These findings provide an inspiring picture of brain structure and dynamics that
is compatible with experimental results on early brain development, and may help to explain
synaptic pruning. Other evolving networks—such as those of protein interactions—might
share the basic ingredients for this feedback loop and other questions, and indeed many of
their structural features are as predicted by our model.We are grateful for financial support from the Spanish MINECO (project of Excellence:
FIS2017-84256-P) and from “Obra Social La Caixa”
Learning to Rank Graph-based Application Objects on Heterogeneous Memories
Persistent Memory (PMEM), also known as Non-Volatile Memory (NVM), can
deliver higher density and lower cost per bit when compared with DRAM. Its main
drawback is that it is typically slower than DRAM. On the other hand, DRAM has
scalability problems due to its cost and energy consumption. Soon, PMEM will
likely coexist with DRAM in computer systems but the biggest challenge is to
know which data to allocate on each type of memory. This paper describes a
methodology for identifying and characterizing application objects that have
the most influence on the application's performance using Intel Optane DC
Persistent Memory. In the first part of our work, we built a tool that
automates the profiling and analysis of application objects. In the second
part, we build a machine learning model to predict the most critical object
within large-scale graph-based applications. Our results show that using
isolated features does not bring the same benefit compared to using a carefully
chosen set of features. By performing data placement using our predictive
model, we can reduce the execution time degradation by 12\% (average) and 30\%
(max) when compared to the baseline's approach based on LLC misses indicator
Using Performance Attributes for Managing Heterogeneous Memory in HPC Applications
International audienceThe complexity of memory systems has increased considerably over the past decade. Supercomputers may now include several levels of heterogeneous and non-uniform memory, with significantly different properties in terms of performance, capacity, persistence, etc. Developers of scientific applications face a huge challenge: efficiently exploit the memory system to improve performance, but keep productivity high by using portable solutions. In this work, we present a new API and a method to manage the complexity of modern memory systems. Our portable and abstracted API is designed to identify memory kinds and describe hardware characteristics using metrics, for example bandwidth, latency and capacity. It allows runtime systems, parallel libraries, and scientific applications to select the appropriate memory by expressing their needs for each allocation without having to remodify the code for each platform. Furthermore we present a survey of existing ways to determine sensitivity of application buffers using static code analysis, profiling and benchmarking. We show in a use case that combining these approaches with our API indeed enables a portable and productive method to match application requirements and hardware memory characteristics
Adjacent LSTM-Based Page Scheduling for Hybrid DRAM/NVM Memory Systems
Recent advances in memory technologies have led to the rapid growth of hybrid systems that combine traditional DRAM and Non Volatile Memory (NVM) technologies, as the latter provide lower cost per byte, low leakage power and larger capacities than DRAM, while they can guarantee comparable access latency. Such kind of heterogeneous memory systems impose new challenges in terms of page placement and migration among the alternative technologies of the heterogeneous memory system. In this paper, we present a novel approach for efficient page placement on heterogeneous DRAM/NVM systems. We design an adjacent LSTM-based approach for page placement, which strongly relies on page accesses prediction, while sharing knowledge among pages with behavioral similarity. The proposed approach leads up to 65.5% optimized performance compared to existing approaches, while achieving near-optimal results and saving 20.2% energy consumption on average. Moreover, we propose a new page replacement policy, namely clustered-LRU, achieving up to 8.1% optimized performance, compared to the default Least Recently Used (LRU) policy
Understanding and Optimizing Serverless Workloads in CXL-Enabled Tiered Memory
Recent Serverless workloads tend to be largescaled/CPU-memory intensive, such
as DL, graph applications, that require dynamic memory-to-compute resources
provisioning.
Meanwhile, recent solutions seek to design page management strategies for
multi-tiered memory systems, to efficiently run heavy workloads. Compute
Express Link (CXL) is an ideal platform for serverless workloads runtime that
offers a holistic memory namespace thanks to its cache coherent feature and
large memory capacity. However, naively offloading Serverless applications to
CXL brings substantial latencies.
In this work, we first quantify CXL impacts on various Serverless
applications. Second, we argue the opportunity of provisioning DRAM and CXL in
a fine-grained, application-specific manner to Serverless workloads, by
creating a shim layer to identify, and naively place hot regions to DRAM, while
leaving cold/warm regions to CXL. Based on the observation, we finally propose
the prototype of Porter, a middleware in-between modern Serverless architecture
and CXL-enabled tiered memory system, to efficiently utilize memory resources,
while saving costs
Remote-scope Promotion: Clarified, Rectified, and Verified
Modern accelerator programming frameworks, such as OpenCL, organise threads into work-groups. Remote-scope promotion (RSP) is a language extension recently proposed by AMD researchers that is designed to enable applications, for the first time, both to optimise for the common case of intra-work-group communication (using memory scopes to provide consistency only within a work-group) and to allow occasional inter-work-group communication (as required, for instance, to support the popular load-balancing idiom of work stealing). We present the first formal, axiomatic memory model of OpenCL extended with RSP. We have extended the Herd memory model simulator with support for OpenCL kernels that exploit RSP, and used it to discover bugs in several litmus tests and a work-stealing queue, that have been used previously in the study of RSP. We have also formalised the proposed GPU implementation of RSP. The formalisation process allowed us to identify bugs in the description of RSP that could result in well-synchronised programs experiencing memory inconsistencies. We present and prove sound a new implementation of RSP that incorporates bug fixes and requires less non-standard hardware than the original implementation. This work, a collaboration between academia and industry, clearly demonstrates how, when designing hardware support for a new concurrent language feature, the early application of formal tools and techniques can help to prevent errors, such as those we have found, from making it into silicon
Efficient Machine Learning on Heterogeneous Computing Systems through a Coordinated Runtime System
Department of Computer Science and EngineeringAs machine learning grows, a heterogeneous computing system is actively used for a solution to increase the efficiency of machine learning. Although there are the prior studies for improving the efficiency of machine learning, the runtime support for heterogeneous computing system remains unexplored field. Our paper presents CEML, which is a runtime system to enhance the efficiency of machine learning on heterogeneous computing systems. CEML characterizes the machine-learning application in terms of the performance and power consumption at runtime, builds accurate the estimation models that estimate the performance and power consumption of the machine-learning application. CEML dynamically adapts the heterogeneous computing system to the efficient system state estimated to enhance the efficiency while satisfying constraints. We demonstrate the effectiveness of CEML by the evaluation in terms of the accuracy of estimators, the energy efficiency, the re-adaptation functionality, and runtime overheads on two full heterogeneous computing systems.clos
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures
Executing machine learning inference tasks on resource-constrained edge
devices requires careful hardware-software co-design optimizations. Recent
examples have shown how transformer-based deep neural network models such as
ALBERT can be used to enable the execution of natural language processing (NLP)
inference on mobile systems-on-chip housing custom hardware accelerators.
However, while these existing solutions are effective in alleviating the
latency, energy, and area costs of running single NLP tasks, achieving
multi-task inference requires running computations over multiple variants of
the model parameters, which are tailored to each of the targeted tasks. This
approach leads to either prohibitive on-chip memory requirements or paying the
cost of off-chip memory access. This paper proposes adapter-ALBERT, an
efficient model optimization for maximal data reuse across different tasks. The
proposed model's performance and robustness to data compression methods are
evaluated across several language tasks from the GLUE benchmark. Additionally,
we demonstrate the advantage of mapping the model to a heterogeneous on-chip
memory architecture by performing simulations on a validated NLP edge
accelerator to extrapolate performance, power, and area improvements over the
execution of a traditional ALBERT model on the same hardware platform.Comment: 10 pages, 6 figures, 3 table
- …