3,808 research outputs found
Design and Implementation of Bandwidth-aware Memory Placement and Migration Policies for Heterogeneous Memory Systems
Department of Computer Science and EngineeringHeterogeneous memory systems are composed of several types of memory, and are used in various
computing domains. Each memory node in heterogeneous memory systems has different characteristics
and performances. A particularly significant difference can be found in access latency and memory
bandwidth. Therefore, the heterogeneity between memories must be considered to utilize the performance
of a heterogeneous memory system. However, most of the previous works did not consider the
bandwidth difference of the memory nodes constituting a heterogeneous memory system.
The present work proposes bandwidth-aware memory placement and migration policies to solve the
problem caused by the bandwidth difference of the memory nodes in a heterogeneous memory system.
We implement three bandwidth-aware memory placement policies and one bandwidth-aware migration
policy on the Linux kernel, then quantitatively experiment on and evaluate them in real systems. In
addition, we prove that our proposed bandwidth-aware memory placement and migration policies can
achieve a higher performance compared to conventional memory placement and migration policies that
do not consider the bandwidth differences between heterogeneous memory nodes.ope
A Survey on Load Balancing Algorithms for VM Placement in Cloud Computing
The emergence of cloud computing based on virtualization technologies brings
huge opportunities to host virtual resource at low cost without the need of
owning any infrastructure. Virtualization technologies enable users to acquire,
configure and be charged on pay-per-use basis. However, Cloud data centers
mostly comprise heterogeneous commodity servers hosting multiple virtual
machines (VMs) with potential various specifications and fluctuating resource
usages, which may cause imbalanced resource utilization within servers that may
lead to performance degradation and service level agreements (SLAs) violations.
To achieve efficient scheduling, these challenges should be addressed and
solved by using load balancing strategies, which have been proved to be NP-hard
problem. From multiple perspectives, this work identifies the challenges and
analyzes existing algorithms for allocating VMs to PMs in infrastructure
Clouds, especially focuses on load balancing. A detailed classification
targeting load balancing algorithms for VM placement in cloud data centers is
investigated and the surveyed algorithms are classified according to the
classification. The goal of this paper is to provide a comprehensive and
comparative understanding of existing literature and aid researchers by
providing an insight for potential future enhancements.Comment: 22 Pages, 4 Figures, 4 Tables, in pres
Improving the Performance and Energy Efficiency of GPGPU Computing through Adaptive Cache and Memory Management Techniques
Department of Computer Science and EngineeringAs the performance and energy efficiency requirement of GPGPUs have risen, memory management techniques of GPGPUs have improved to meet the requirements by employing hardware caches and utilizing heterogeneous memory. These techniques can improve GPGPUs by providing lower latency and higher bandwidth of the memory. However, these methods do not always guarantee improved performance and energy efficiency due to the small cache size and heterogeneity of the memory nodes. While prior works have proposed various techniques to address this issue, relatively little work has been done to investigate holistic support for memory management techniques.
In this dissertation, we analyze performance pathologies and propose various techniques to improve memory management techniques. First, we investigate the effectiveness of advanced cache indexing (ACI) for high-performance and energy-efficient GPGPU computing. Specifically, we discuss the designs of various static and adaptive cache indexing schemes and present implementation for GPGPUs. We then quantify and analyze the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation shows that ACI schemes achieve significant performance and energy-efficiency gains over baseline conventional indexing scheme. We also analyze the performance sensitivity of ACI to key architectural parameters (i.e., capacity, associativity, and ICN bandwidth) and the cache indexing latency. We also demonstrate that ACI continues to achieve high performance in various settings.
Second, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. Based on the performance pathology analysis of GPGPUs, we integrate state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework to eliminate performance pathologies. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1% and 61.9% on average, respectively) and achieves considerably higher performance than the state-of-the-art technique (i.e., 361.4% at maximum and 7.7% on average). Furthermore, IACM delivers significant performance and energy efficiency gains over the baseline GPGPU architecture even when enhanced with advanced architectural technologies (e.g., higher capacity, associativity).
Third, we propose bandwidth- and latency-aware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP analyzes the characteristics of a application and determines the optimal page allocation ratio between the GPU and CPU memory. Based on the optimal page allocation ratio, BLPP dynamically allocate pages across the heterogeneous memory nodes. Our experimental results show that BLPP considerably outperforms the baseline and state-of-the-art technique (i.e., 13.4% and 16.7%) and performs similar to the static-best version (i.e., 1.2% difference), which requires extensive offline profiling.clos
Doctor of Philosophy
dissertationIn recent years, a number of trends have started to emerge, both in microprocessor and application characteristics. As per Moore's law, the number of cores on chip will keep doubling every 18-24 months. International Technology Roadmap for Semiconductors (ITRS) reports that wires will continue to scale poorly, exacerbating the cost of on-chip communication. Cores will have to navigate an on-chip network to access data that may be scattered across many cache banks. The number of pins on the package, and hence available off-chip bandwidth, will at best increase at sublinear rate and at worst, stagnate. A number of disruptive memory technologies, e.g., phase change memory (PCM) have begun to emerge and will be integrated into the memory hierarchy sooner than later, leading to non-uniform memory access (NUMA) hierarchies. This will make the cost of accessing main memory even higher. In previous years, most of the focus has been on deciding the memory hierarchy level where data must be placed (L1 or L2 caches, main memory, disk, etc.). However, in modern and future generations, each level is getting bigger and its design is being subjected to a number of constraints (wire delays, power budget, etc.). It is becoming very important to make an intelligent decision about where data must be placed within a level. For example, in a large non-uniform access cache (NUCA), we must figure out the optimal bank. Similarly, in a multi-dual inline memory module (DIMM) non uniform memory access (NUMA) main memory, we must figure out the DIMM that is the optimal home for every data page. Studies have indicated that heterogeneous main memory hierarchies that incorporate multiple memory technologies are on the horizon. We must develop solutions for data management that take heterogeneity into account. For these memory organizations, we must again identify the appropriate home for data. In this dissertation, we attempt to verify the following thesis statement: "Can low-complexity hardware and OS mechanisms manage data placement within each memory hierarchy level to optimize metrics such as performance and/or throughput?" In this dissertation we argue for a hardware-software codesign approach to tackle the above mentioned problems at different levels of the memory hierarchy. The proposed methods utilize techniques like page coloring and shadow addresses and are able to handle a large number of problems ranging from managing wire-delays in large, shared NUCA caches to distributing shared capacity among different cores. We then examine data-placement issues in NUMA main memory for a many-core processor with a moderate number of on-chip memory controllers. Using codesign approaches, we achieve efficient data placement by modifying the operating system's (OS) page allocation algorithm for a wide variety of main memory architectures
Resource and thermal management in 3D-stacked multi-/many-core systems
Continuous semiconductor technology scaling and the rapid increase in computational needs have stimulated the emergence of multi-/many-core processors. While up to hundreds of cores can be placed on a single chip, the performance capacity of the cores cannot be fully exploited due to high latencies of interconnects and memory, high power consumption, and low manufacturing yield in traditional (2D) chips. 3D stacking is an emerging technology that aims to overcome these limitations of 2D designs by stacking processor dies over each other and using through-silicon-vias (TSVs) for on-chip communication, and thus, provides a large amount of on-chip resources and shortens communication latency. These benefits, however, are limited by challenges in high power densities and temperatures.
3D stacking also enables integrating heterogeneous technologies into a single chip. One example of heterogeneous integration is building many-core systems with silicon-photonic network-on-chip (PNoC), which reduces on-chip communication latency significantly and provides higher bandwidth compared to electrical links. However, silicon-photonic links are vulnerable to on-chip thermal and process variations. These variations can be countered by actively tuning the temperatures of optical devices through micro-heaters, but at the cost of substantial power overhead.
This thesis claims that unearthing the energy efficiency potential of 3D-stacked systems requires intelligent and application-aware resource management. Specifically, the thesis improves energy efficiency of 3D-stacked systems via three major components of computing systems: cache, memory, and on-chip communication. We analyze characteristics of workloads in computation, memory usage, and communication, and present techniques that leverage these characteristics for energy-efficient computing.
This thesis introduces 3D cache resource pooling, a cache design that allows for flexible heterogeneity in cache configuration across a 3D-stacked system and improves cache utilization and system energy efficiency. We also demonstrate the impact of resource pooling on a real prototype 3D system with scratchpad memory.
At the main memory level, we claim that utilizing heterogeneous memory modules and memory object level management significantly helps with energy efficiency. This thesis proposes a memory management scheme at a finer granularity: memory object level, and a page allocation policy to leverage the heterogeneity of available memory modules and cater to the diverse memory requirements of workloads.
On the on-chip communication side, we introduce an approach to limit the power overhead of PNoC in (3D) many-core systems through cross-layer thermal management. Our proposed thermally-aware workload allocation policies coupled with an adaptive thermal tuning policy minimize the required thermal tuning power for PNoC, and in this way, help broader integration of PNoC. The thesis also introduces techniques in placement and floorplanning of optical devices to reduce optical loss and, thus, laser source power consumption.2018-03-09T00:00:00
Performance-oriented Cloud Provisioning: Taxonomy and Survey
Cloud computing is being viewed as the technology of today and the future.
Through this paradigm, the customers gain access to shared computing resources
located in remote data centers that are hosted by cloud providers (CP). This
technology allows for provisioning of various resources such as virtual
machines (VM), physical machines, processors, memory, network, storage and
software as per the needs of customers. Application providers (AP), who are
customers of the CP, deploy applications on the cloud infrastructure and then
these applications are used by the end-users. To meet the fluctuating
application workload demands, dynamic provisioning is essential and this
article provides a detailed literature survey of dynamic provisioning within
cloud systems with focus on application performance. The well-known types of
provisioning and the associated problems are clearly and pictorially explained
and the provisioning terminology is clarified. A very detailed and general
cloud provisioning classification is presented, which views provisioning from
different perspectives, aiding in understanding the process inside-out. Cloud
dynamic provisioning is explained by considering resources, stakeholders,
techniques, technologies, algorithms, problems, goals and more.Comment: 14 pages, 3 figures, 3 table
- …