359 research outputs found

    Randomized cache placement for eliminating conflicts

    Get PDF
    Applications with regular patterns of memory access can experience high levels of cache conflict misses. In shared-memory multiprocessors conflict misses can be increased significantly by the data transpositions required for parallelization. Techniques such as blocking which are introduced within a single thread to improve locality, can result in yet more conflict misses. The tension between minimizing cache conflicts and the other transformations needed for efficient parallelization leads to complex optimization problems for parallelizing compilers. This paper shows how the introduction of a pseudorandom element into the cache index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set behavior. We examine the impact of pseudorandom cache indexing on processor cycle times and present practical solutions to some of the major implementation issues for this type of cache. Our conclusions are supported by simulations of a superscalar out-of-order processor executing the SPEC95 benchmarks, as well as from cache simulations of individual loop kernels to illustrate specific effects. We present measurements of instructions committed per cycle (IPC) when comparing the performance of different cache architectures on whole-program benchmarks such as the SPEC95 suite.Peer ReviewedPostprint (published version

    A Study on Performance and Power Efficiency of Dense Non-Volatile Caches in Multi-Core Systems

    Full text link
    In this paper, we present a novel cache design based on Multi-Level Cell Spin-Transfer Torque RAM (MLC STTRAM) that can dynamically adapt the set capacity and associativity to use efficiently the full potential of MLC STTRAM. We exploit the asymmetric nature of the MLC storage scheme to build cache lines featuring heterogeneous performances, that is, half of the cache lines are read-friendly, while the other is write-friendly. Furthermore, we propose to opportunistically deactivate ways in underutilized sets to convert MLC to Single-Level Cell (SLC) mode, which features overall better performance and lifetime. Our ultimate goal is to build a cache architecture that combines the capacity advantages of MLC and performance/energy advantages of SLC. Our experiments show an improvement of 43% in total numbers of conflict misses, 27% in memory access latency, 12% in system performance, and 26% in LLC access energy, with a slight degradation in cache lifetime (about 7%) compared to an SLC cache

    Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

    Get PDF
    The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER

    Author retrospective for the dual data cache

    Get PDF
    In this paper we present a retrospective on our paper published in ICS 1995, which to best of our knowledge was the first paper that introduced the concept of a cache memory with multiple subcaches, each tuned for a different type of locality. In this retrospective, we summarize the main ideas of the original paper and outline some of the later work that exploited similar ideas and could have been influenced by our original paper, including two actual industrial microprocessors.Peer ReviewedPostprint (author’s final draft

    A Simple Multi-Core Functional Cache Design Simulator

    Get PDF
    This paper presents a flexible multi-core cache memory simulator to design and evaluate memory hierarchies for general-purpose or embedded processors. The proposed simulator needs to work with Pin, which is an open-source dynamic instrumentation tool provided by Intel. The Pin intercepts the execution of instructions and generates a sequence code (traces) to feed into the simulator for any selected benchmark programs, such as SPEC2006, SPLASH2, or PARSEC. We have a plan to release this simulator as an open-source (like Pin) to support research and/or academic community for their simulation works. In addition, we expect more functions can be updated on top of this simulator to share by the research community

    Yet Another Compressed Cache: a Low Cost Yet Effective Compressed Cache

    Get PDF
    Cache memories play a critical role in bridging the latency, bandwidth, and energy gaps between cores and off-chip memory. However, caches frequently consume a significant fraction of a multicore chip’s area, and thus account for a significant fraction of its cost. Compression has the potential to improve the effective capacity of a cache, providing the performance and energy benefits of a larger cache while using less area. The design of a compressed cache must address two important issues: i) a low-latency, low-overhead compression algorithm that can represent a fixed-size cache block using fewer bits and ii) a cache organization that can efficiently store the resulting variable-size compressed blocks. This paper focuses on the latter issue. In this paper, we propose YACC (Yet Another Compressed Cache), a new compressed cache design that uses super-blocks to reduce tag overheads and variable-size blocks to reduce internal fragmentation, but eliminates two major sources of complexity in previous work—decoupled tag-data mapping and address skewing. YACC’s cache layout is similar to conventional caches, eliminating the back-pointers used to maintain a decoupled tag-data mapping and the extra decoders used to implement skewed associativity. An additional advantage of YACC is that it enables modern replacement mechanisms, such as RRIP. For our benchmark set, YACC performs comparably to the recently-proposed Skewed Compressed Cache (SCC) ‎[Sardashti et al. 2014], but with a simpler, more area efficient design without the complexity and overheads of skewing. Compared to a conventional uncompressed 8MB LLC, YACC improves performance by on average 8% and up to 26%, and reduces total energy by on average 6% and up to 20%. An 8MB YACC achieves approximately the same performance and energy improvements as a 16MB conventional cache at a much smaller silicon footprint, with 1.6% higher area than an 8MB conventional cach

    A Survey of Techniques for Architecting TLBs

    Get PDF
    “Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

    Archexplorer for automatic design space exploration

    Get PDF
    Growing architectural complexity and stringent time-to-market constraints suggest the need to move architecture design beyond parametric exploration to structural exploration. ArchExplorer is a Web-based permanent and open design-space exploration framework that lets researchers compare their designs against others. The authors demonstrate their approach by exploring the design space of an on-chip memory subsystem and a multicore processor.Postprint (published version
    corecore