18 research outputs found

    Paving the Path for Heterogeneous Memory Adoption in Production Systems

    Full text link
    Systems from smartphones to data-centers to supercomputers are increasingly heterogeneous, comprising various memory technologies and core types. Heterogeneous memory systems provide an opportunity to suitably match varying memory access pat- terns in applications, reducing CPU time thus increasing performance per dollar resulting in aggregate savings of millions of dollars in large-scale systems. However, with increased provisioning of main memory capacity per machine and differences in memory characteristics (for example, bandwidth, latency, cost, and density), memory management in such heterogeneous memory systems poses multi-fold challenges on system programmability and design. In this thesis, we tackle memory management of two heterogeneous memory systems: (a) CPU-GPU systems with a unified virtual address space, and (b) Cloud computing platforms that can deploy cheaper but slower memory technologies alongside DRAMs to reduce cost of memory in data-centers. First, we show that operating systems do not have sufficient information to optimally manage pages in bandwidth-asymmetric systems and thus fail to maximize bandwidth to massively-threaded GPU applications sacrificing GPU throughput. We present BW-AWARE placement/migration policies to support OS to make optimal data management decisions. Second, we present a CPU-GPU cache coherence design where CPU and GPU need not implement same cache coherence protocol but provide cache-coherent memory interface to the programmer. Our proposal is first practical approach to provide a unified, coherent CPU–GPU address space without requiring hardware cache coherence, with a potential to enable an explosion in algorithms that leverage tightly coupled CPU–GPU coordination. Finally, to reduce the cost of memory in cloud platforms where the trend has been to map datasets in memory, we make a case for a two-tiered memory system where cheaper (per bit) memories, such as Intel/Microns 3D XPoint, will be deployed alongside DRAM. We present Thermostat, an application-transparent huge-page-aware software mechanism to place pages in a dual-technology hybrid memory system while achieving both the cost advantages of two-tiered memory and performance advantages of transparent huge pages. With Thermostat’s capability to control the application slowdown on a per application basis, cloud providers can realize cost savings from upcoming cheaper memory technologies by shifting infrequently accessed cold data to slow memory, while satisfying throughput demand of the customers.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/137052/1/nehaag_1.pd

    Cache-Aware Real-Time Virtualization

    Get PDF
    Virtualization has been adopted in diverse computing environments, ranging from cloud computing to embedded systems. It enables the consolidation of multi-tenant legacy systems onto a multicore processor for Size, Weight, and Power (SWaP) benefits. In order to be adopted in timing-critical systems, virtualization must provide real-time guarantee for tasks and virtual machines (VMs). However, existing virtualization technologies cannot offer such timing guarantee. Tasks in VMs can interfere with each other through shared hardware components. CPU cache, in particular, is a major source of interference that is hard to analyze or manage. In this work, we focus on challenges of the impact of cache-related interferences on the real-time guarantee of virtualization systems. We propose the cache-aware real-time virtualization that provides both system techniques and theoretical analysis for tackling the challenges. We start with the challenge of the private cache overhead and propose the private cache-aware compositional analysis. To tackle the challenge of the shared cache interference, we start with non-virtualization systems and propose a shared cache-aware scheduler for operating systems to co-allocate both CPU and cache resources to tasks and develop the analysis. We then investigate virtualization systems and propose a dynamic cache management framework that hierarchically allocates shared cache to tasks. After that, we further investigate the resource allocation and analysis technique that considers not only cache resource but also CPU and memory bandwidth resources. Our solutions are applicable to commodity hardware and are essential steps to advance virtualization technology into timing-critical systems

    Integration of Non-volatile Memory into Storage Hierarchy

    Get PDF
    In this dissertation, we present novel approaches for integrating non-volatile memory devices into storage hierarchy of a computer system. There are several types of non- volatile memory devices, such as flash memory, Phase Change Memory (PCM), Spin- transfer torque memory (STT-RAM). These devices have many appealing features for applications; however, they also offer several challenges. This dissertation is focused on how to efficiently integrate these non-volatile memories into existing memory and disk storage systems. This work is composed of two major parts. The first part investigates a main-memory system employing Phase Change Memory instead of traditional DRAM. Compared to DRAM, PCM has higher density and no static power consumption, which are very important factors for building large capacity memory systems. However, PCM has higher write latency and power consumption compared to read operations. Moreover, PCM has limited write endurance. To efficiently integrate PCM into a memory system, we have to solve the challenges brought by its expensive write operations. We propose new replacement policies and cache organizations for the last-level CPU cache, which can effectively reduce the write traffic to the PCM main memory. We evaluated our design with multiple workloads and configurations. The results show that the proposed approaches improve the lifetime and energy consumption of PCM significantly. The second part of the dissertation considers the design of a data/disk storage using non-volatile memories, e.g. flash memory, PCM and nonvolatile DIMMs. We consider multiple design options for utilizing the nonvolatile memories in the storage hierarchy. First, we consider a system that employs nonvolatile memories such as PCM or nonvolatile DIMMs on memory bus along with flash-based SSDs. We propose a hybrid file system, NVMFS, that manages both these devices. NVMFS exploits the nonvolatile memory to improve the characteristics of the write workload at the SSD. We satisfy most small random write requests on the fast nonvolatile DIMM and only do large and optimized writes on SSD. We also group data of similar update patterns together before writing to flash-SSD; as a result, we can effectively reduce the garbage collection overhead. We implemented a prototype of NVMFS in Linux and evaluated its performance through multiple benchmarks. Secondly, we consider the problem of using flash memory as a cache for a disk drive based storage system. Since SSDs are expensive, a few SSDs are designed to serve as a cache for a large number of disk drives. SSD cache space can be used for both read and write requests. In our design, we managed multiple flash-SSD devices directly at the cache layer without the help of RAID software. To ensure data reliability and cache space efficiency, we only duplicated dirty data on flash- SSDs. We also balanced the write endurance of different flash-SSDs. As a result, no single SSD will fail much earlier than the others. Thirdly, when using PCM-like devices only as data storage, it’s possible to exploit memory management hardware resources to improve file system performance. However, in this case, PCM may share critical system resources such as the TLB, page table with DRAM which can potentially impact PCM’s performance. To solve this problem, we proposed to employ superpages to reduce the pressure on memory management resources. As a result, the file system performance is further improved

    A Safety-First Approach to Memory Models.

    Full text link
    Sequential consistency (SC) is arguably the most intuitive behavior for a shared-memory multithreaded program. It is widely accepted that language-level SC could significantly improve programmability of a multiprocessor system. However, efficiently supporting end-to-end SC remains a challenge as it requires that both compiler and hardware optimizations preserve SC semantics. Current concurrent languages support a relaxed memory model that requires programmers to explicitly annotate all memory accesses that can participate in a data-race ("unsafe" accesses). This requirement allows compiler and hardware to aggressively optimize unannotated accesses, which are assumed to be data-race-free ("safe" accesses), while still preserving SC semantics. However, unannotated data races are easy for programmers to accidentally introduce and are difficult to detect, and in such cases the safety and correctness of programs are significantly compromised. This dissertation argues instead for a safety-first approach, whereby every memory operation is treated as potentially unsafe by the compiler and hardware unless it is proven otherwise. The first solution, DRFx memory model, allows many common compiler and hardware optimizations (potentially SC-violating) on unsafe accesses and uses a runtime support to detect potential SC violations arising from reordering of unsafe accesses. On detecting a potential SC violation, execution is halted before the safety property is compromised. The second solution takes a different approach and preserves SC in both compiler and hardware. Both SC-preserving compiler and hardware are also built on the safety-first approach. All memory accesses are treated as potentially unsafe by the compiler and hardware. SC-preserving hardware relies on different static and dynamic techniques to identify safe accesses. Our results indicate that supporting SC at the language level is not expensive in terms of performance and hardware complexity. The dissertation also explores an extension of this safety-first approach for data-parallel accelerators such as Graphics Processing Units (GPUs). Significant microarchitectural differences between CPU and GPU require rethinking of efficient solutions for preserving SC in GPUs. The proposed solution based on our SC-preserving approach performs nearly on par with the baseline GPU that implements a data-race-free-0 memory model.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120794/1/ansingh_1.pd
    corecore