1,309 research outputs found

    A Study on Performance and Power Efficiency of Dense Non-Volatile Caches in Multi-Core Systems

    Full text link
    In this paper, we present a novel cache design based on Multi-Level Cell Spin-Transfer Torque RAM (MLC STTRAM) that can dynamically adapt the set capacity and associativity to use efficiently the full potential of MLC STTRAM. We exploit the asymmetric nature of the MLC storage scheme to build cache lines featuring heterogeneous performances, that is, half of the cache lines are read-friendly, while the other is write-friendly. Furthermore, we propose to opportunistically deactivate ways in underutilized sets to convert MLC to Single-Level Cell (SLC) mode, which features overall better performance and lifetime. Our ultimate goal is to build a cache architecture that combines the capacity advantages of MLC and performance/energy advantages of SLC. Our experiments show an improvement of 43% in total numbers of conflict misses, 27% in memory access latency, 12% in system performance, and 26% in LLC access energy, with a slight degradation in cache lifetime (about 7%) compared to an SLC cache

    The ZCache: Decoupling Ways and Associativity

    Full text link
    Abstract—The ever-increasing importance of main memory latency and bandwidth is pushing CMPs towards caches with higher capacity and associativity. Associativity is typically im-proved by increasing the number of ways. This reduces conflict misses, but increases hit latency and energy, placing a stringent trade-off on cache design. We present the zcache, a cache design that allows much higher associativity than the number of physical ways (e.g. a 64-associative cache with 4 ways). The zcache draws on previous research on skew-associative caches and cuckoo hashing. Hits, the common case, require a single lookup, incurring the latency and energy costs of a cache with a very low number of ways. On a miss, additional tag lookups happen off the critical path, yielding an arbitrarily large number of replacement candidates for the incoming block. Unlike conventional designs, the zcache provides associativity by increasing the number of replacement candidates, but not the number of cache ways. To understand the implications of this approach, we develop a general analysis framework that allows to compare associativity across different cache designs (e.g. a set-associative cache and a zcache) by representing associativity as a probability distribution. We use this framework to show that for zcaches, associativity depends only on the number of replacement candidates, and is independent of other factors (such as the number of cache ways or the workload). We also show that, for the same number of replacement candidates, the associativity of a zcache is superior than that of a set-associative cache for most workloads. Finally, we perform detailed simulations of multithreaded and multiprogrammed workloads on a large-scale CMP with zcache as the last-level cache. We show that zcaches provide higher performance and better energy efficiency than conventional caches without incurring the overheads of designs with a large number of ways. I

    Fast speculative address generation and way caching for reducing L1 data cache energy

    Get PDF
    L1 data caches in high-performance processors continue to grow in set associativity. Higher associativity can significantly increase the cache energy consumption. Cache access latency can be affected as well, leading to an increase in overall energy consumption due to increased execution time. At the same time, the static energy consumption of the cache increases significantly with each new process generation. This paper proposes a new approach to reduce the overall L1 cache energy consumption using a combination of way caching and fast, speculative address generation. A 16-entry way cache storing a 3-bit way number for recently accessed L1 data cache lines is shown sufficient to significantly reduce both static and dynamic energy consumption of the L1 cache. Fast speculative address generation helps to hide the way cache access latency and is highly accurate. The L1 cache energy-delay product is reduced by 10% compared to using the way cache alone and by 37% compared to the use of multiple MRU technique.Peer ReviewedPostprint (published version

    Cycle Accurate Energy and Throughput Estimation for Data Cache

    Get PDF
    Resource optimization in energy constrained real-time adaptive embedded systems highly depends on accurate energy and throughput estimates of processor peripherals. Such applications require lightweight, accurate mathematical models to profile energy and timing requirements on the go. This paper presents enhanced mathematical models for data cache energy and throughput estimation. The energy and throughput models were found to be within 95% accuracy of per instruction energy model of a processor, and a full system simulator?s timing model respectively. Furthermore, the possible application of these models in various scenarios is discussed in this paper

    Buffer Controlled Cache for Low Power Multicore Processors

    Get PDF
    This thesis proposes a buffered dual access mode cache to reduce power consumption in multicore caches for embedded systems. This cache is called Buffer Controlled Cache (BCC cache). The proposed scheme introduces a pre-cache buffer to determine how to access the cache. The proposed cache shows better prediction rates and lower power consumption than conventional caches, such as Phased cache and Way-prediction cache. For single core implementation, Simplescalar and Cacti simulators have been used for these simulations using SPEC2000 benchmark programs. The experimental results show that the proposed cache improves the power consumption by 37%-42% over the conventional caches. Multi2Sim and McPAT simulators have been used for the multicore simulations using the Parsec benchmark programs. The experimental results show that the proposed cache improves the power consumption by as much as 54% over conventional caches

    Microarchitectural techniques to reduce energy consumption in the memory hierarchy

    Get PDF
    This thesis states that dynamic profiling of the memory reference stream can improve energy and performance in the memory hierarchy. The research presented in this theses provides multiple instances of using lightweight hardware structures to profile the memory reference stream. The objective of this research is to develop microarchitectural techniques to reduce energy consumption at different levels of the memory hierarchy. Several simple and implementable techniques were developed as a part of this research. One of the techniques identifies and eliminates redundant refresh operations in DRAM and reduces DRAM refresh power. Another, reduces leakage energy in L2 and higher level caches for multiprocessor systems. The emphasis of this research has been to develop several techniques of obtaining energy savings in caches using a simple hardware structure called the counting Bloom filter (CBF). CBFs have been used to predict L2 cache misses and obtain energy savings by not accessing the L2 cache on a predicted miss. A simple extension of this technique allows CBFs to do way-estimation of set associative caches to reduce energy in cache lookups. Another technique using CBFs track addresses in a Virtual Cache and reduce false synonym lookups. Finally this thesis presents a technique to reduce dynamic power consumption in level one caches using significance compression. The significant energy and performance improvements demonstrated by the techniques presented in this thesis suggest that this work will be of great value for designing memory hierarchies of future computing platforms.Ph.D.Committee Chair: Lee, Hsien-Hsin S.; Committee Member: Cahtterjee,Abhijit; Committee Member: Mukhopadhyay, Saibal; Committee Member: Pande, Santosh; Committee Member: Yalamanchili, Sudhaka
    • …
    corecore