145 research outputs found

    Leeway: Addressing Variability in Dead-Block Prediction for Last-Level Caches

    Get PDF

    Towards Design and Analysis For High-Performance and Reliable SSDs

    Get PDF
    NAND Flash-based Solid State Disks have many attractive technical merits, such as low power consumption, light weight, shock resistance, sustainability of hotter operation regimes, and extraordinarily high performance for random read access, which makes SSDs immensely popular and be widely employed in different types of environments including portable devices, personal computers, large data centers, and distributed data systems. However, current SSDs still suffer from several critical inherent limitations, such as the inability of in-place-update, asymmetric read and write performance, slow garbage collection processes, limited endurance, and degraded write performance with the adoption of MLC and TLC techniques. To alleviate these limitations, we propose optimizations from both specific outside applications layer and SSDs\u27 internal layer. Since SSDs are good compromise between the performance and price, so SSDs are widely deployed as second layer caches sitting between DRAMs and hard disks to boost the system performance. Due to the special properties of SSDs such as the internal garbage collection processes and limited lifetime, traditional cache devices like DRAM and SRAM based optimizations might not work consistently for SSD-based cache. Therefore, for the outside applications layer, our work focus on integrating the special properties of SSDs into the optimizations of SSD caches. Moreover, our work also involves the alleviation of the increased Flash write latency and ECC complexity due to the adoption of MLC and TLC technologies by analyzing the real work workloads

    Exploring Superpage Promotion Policies for Efficient Address Translation

    Get PDF
    Address translation performance for modern applications depends heavily upon the number of translation entries cached in the hardware TLB (translation look-aside buffer). Therefore, the efficiency of address translation relies directly on the TLB hit rate. The number of TLB entries continues to fall further behind the growth of memory consumption for modern applications. Superpages, which are pages with larger sizes, can increase the efficiency of the TLB by enabling each translation entry to cover a larger memory region. Without requiring more TLB entries, using superpages can increase the TLB hit rate and benefit address translation. However, using superpages can bring overhead. The TLB uses a single dirty bit to mark a page as dirty during address translation before modifying the page, so the granularity of the dirty bit corresponds to the coverage of the translation entry. As a result, the OS (operating system) will pay extra I/O effort when it allocates or writes an underutilized superpage back to disk. Such extra overhead can easily surpass the address translation benefits of superpages. This thesis discusses the performance trade-offs of superpages by exploring the design space of superpage promotion policies in the OS. A data collection infrastructure is built based on QEMU with kernel instrumentation on FreeBSD to collaboratively collect both memory accesses and kernel events. Then, the TLB behavior of Intel Skylake x86 family processors is simulated. The simulation has been validated to be faithful and consistent with the real-world performance. Last, this thesis evaluates and compares both TLB performance benefits and I/O overheads among the superpage promotion policies to discuss the trade-offs in the design space

    Runtime-Driven Shared Last-Level Cache Management for Task-Parallel Programs

    Get PDF

    Addressing variability in reuse prediction for last-level caches

    Get PDF
    Last-Level Cache (LLC) represents the bulk of a modern CPU processor's transistor budget and is essential for application performance as LLC enables fast access to data in contrast to much slower main memory. Problematically, technology constraints make it infeasible to scale LLC capacity to meet the ever-increasing working set size of the applications. Thus, future processors will rely on effective cache management mechanisms and policies to get more performance out of the scarce LLC capacity. Applications with large working set size often exhibit streaming and/or thrashing access patterns at LLC. As a result, a large fraction of the LLC capacity is occupied by dead blocks that will not be referenced again, leading to inefficient utilization of the LLC capacity. To improve cache efficiency, the state-of-the-art cache management techniques employ prediction mechanisms that learn from the past access patterns with an aim to accurately identify as many dead blocks as possible. Once identified, dead blocks are evicted from LLC to make space for potentially high reuse cache blocks. In this thesis, we identify variability in the reuse behavior of cache blocks as the key limiting factor in maximizing cache efficiency for state-of-the-art predictive techniques. Variability in reuse prediction is inevitable due to numerous factors that are outside the control of LLC. The sources of variability include control-flow variation, speculative execution and contention from cores sharing the cache, among others. Variability in reuse prediction challenges existing techniques in reliably identifying the end of a block's useful lifetime, thus causing lower prediction accuracy, coverage, or both. To address this challenge, this thesis aims to design robust cache management mechanisms and policies for LLC in the face of variability in reuse prediction to minimize cache misses, while keeping the cost and complexity of the hardware implementation low. To that end, we propose two cache management techniques, one domain-agnostic and one domain-specialized, to improve cache efficiency by addressing variability in reuse prediction. In the first part of the thesis, we consider domain-agnostic cache management, a conventional approach to cache management, in which the LLC is managed fully in hardware, and thus the cache management is transparent to the software. In this context, we propose Leeway, a novel domain-agnostic cache management technique. Leeway introduces a new metric, Live Distance, that captures the largest interval of temporal reuse for a cache block, providing a conservative estimate of a cache block's useful lifetime. Leeway implements a robust prediction mechanism that identifies dead blocks based on their past Live Distance values. Leeway monitors the change in Live Distance values at runtime and dynamically adapts its reuse-aware policies to maximize cache efficiency in the face of variability. In the second part of the thesis, we identify applications, for which existing domain-agnostic cache management techniques struggle in exploiting the high reuse due to variability arising from certain fundamental application characteristics. Specifically, applications from the domain of graph analytics inherently exhibit high reuse when processing natural graphs. However, the reuse pattern is highly irregular and dependent on graph topology; a small fraction of vertices, hot vertices, exhibit high reuse whereas a large fraction of vertices exhibit low- or no-reuse. Moreover, the hot vertices are sparsely distributed in the memory space. Data-dependent irregular access patterns, combined with the sparse distribution of hot vertices, make it difficult for existing domain-agnostic predictive techniques in reliably identifying, and, in turn, retaining hot vertices in cache, causing severe underutilization of the LLC capacity. In this thesis, we observe that the software is aware of the application reuse characteristics, which, if passed on to the hardware efficiently, can help hardware in reliably identifying the most useful working set even amidst irregular access patterns. To that end, we propose a holistic approach of software-hardware co-design to effectively manage LLC for the domain of graph analytics. Our software component implements a novel lightweight software technique, called Degree-Based Grouping (DBG), that applies a coarse-grain graph reordering to segregate hot vertices in a contiguous memory region to improve spatial locality. Meanwhile, our hardware component implements a novel domain-specialized cache management technique, called Graph Specialized Cache Management (GRASP). GRASP augments existing cache policies to maximize reuse of hot vertices by protecting them against cache thrashing, while maintaining sufficient flexibility to capture the reuse of other vertices as needed. To reliably identify hot vertices amidst irregular access patterns, GRASP leverages the DBG-enabled contiguity of hot vertices. Our domain-specialized cache management not only outperforms the state-of-the-art domain-agnostic predictive techniques, but also eliminates the need for any storage-intensive prediction mechanisms
    corecore