37 research outputs found

    Last Bank: dealing with address reuse in non-uniform cache architecture for CMPs

    Get PDF
    In response to the constant increase in wire delays, Non-Uniform Cache Architecture (NUCA) has been introduced as an effective memory model for dealing with growing memory latencies. This architecture divides a large memory cache into smaller banks that can be accessed independently. Banks close to the cache controller therefore have a faster response time than banks located farther away from it. In this paper, we propose and analyse the insertion of an additional bank into the NUCA cache. This is called Last Bank. This extra bank deals with data blocks that have been evicted from the other banks in the NUCA cache. Furthermore, we analyse the behaviour of the cache line replacements done in the NUCA cache and propose two optimisations of Last Bank that provide significant performance benefits without incurring unaffordable implementation costs.Preprin

    LRU-PEA: A smart replacement policy for non-uniform cache architectures on chip multiprocessors

    Get PDF
    The increasing speed-gap between processor and memory and the limited memory bandwidth make last-level cache performance crucial for CMP architectures. Non Uniform Cache Architectures (NUCA) has been introduced to deal with this problem. This memory organization divides the whole memory space into smaller pieces or banks allowing nearer banks to have better access latencies than further banks.Moreover, an adaptive replacement policy that efficiently reduces misses in the last-level cache could boost performance, particularly if set associativity is assumed. Unfortunately, traditional replacement policies do not behave properly as they were assumed for single-processors. This paper focuses on Bank Replacement. This policy involves three key decisions when there is a miss: where to place a data within the cache set, which data to evict from the cache set and finally, where to place the evicted data. We propose a novel replacement technique that enables more intelligent replacement decisions to be taken, based on the observation that some type of data are less commonly accessed depending of the bank where they reside. We call this technique as LRU-PEA (Least Recently Used with a Priority Eviction Approach). We show that the proposed technique significantly reduces the requests to the off-chip memory by increasing the hit ratio in the NUCA cache. This translates into an average IPC improvement of 8% and into an Energy per Instruction (EPI) reduction of 5%.Preprin

    LRU-PEA: A smart replacement policy for non-uniform cache architectures on chip multiprocessors

    Full text link

    Revisiting LP-NUCA Energy Consumption: Cache Access Policies and Adaptive Block Dropping

    Get PDF
    Cache working-set adaptation is key as embedded systems move to multiprocessor and Simultaneous Multithreaded Architectures (SMT) because interthread pollution harms system performance and battery life. Light-Power NUCA (LP-NUCA) is a working-set adaptive cache that depends on temporal-locality to save energy. This work identifies the sources of energy waste in LP-NUCAs: parallel access to the tag and data arrays of the tiles and low locality phases with useless block migration. To counteract both issues, we prove that switching to serial access reduces energy without harming performance and propose a machine learning Adaptive Drop Rate (ADR) controller that minimizes the amount of replacement and migration when locality is low. This work demonstrates that these techniques efficiently adapt the cache drop and access policies to save energy. They reduce LP-NUCA consumption 22.7% for 1SMT. With interthread cache contention in 2SMT, the savings rise to 29%. Versus a conventional organization, energy--delay improves 20.8% and 25% for 1- and 2SMT benchmarks, and, in 65% of the 2SMT mixes, gains are larger than 20%

    Adaptive Resource Management Techniques for High Performance Multi-Core Architectures

    Get PDF
    Reducing the average memory access time is crucial for improving the performance of applications executing on multi-core architectures. With workload consolidation this becomes increasingly challenging due to shared resource contention. Previous works has proposed techniques for partitioning of shared resources (e.g. cache and bandwidth) and prefetch throttling with the goal of mitigating contention and reducing or hiding average memory access time.Cache partitioning in multi-core architectures is challenging due to the need to determine cache allocations with low computational overhead and the need to place the partitions in a locality-aware manner. The requirement for low computational overhead is important in order to have the capability to scale to large core counts. Previous work within multi-resource management has proposed coordinately managing a subset of the techniques: cache partitioning, bandwidth partitioning and prefetch throttling. However, coordinated management of all three techniques opens up new possible trade-offs and interactions which can be leveraged to gain better performance. This thesis contributes with two different resource management techniques: One resource manger for scalable cache partitioning and a multi-resource management technique for coordinated management of cache partitioning, bandwidth partitioning and prefetching. The scalable resource management technique for cache partitioning uses a distributed and asynchronous cache partitioning algorithm that works together with a flexible NUCA enforcement mechanism in order to give locality-aware placement of data and support fine-grained partitions. The algorithm adapts quickly to application phase changes. The distributed nature of the algorithm together with the low computational complexity, enables the solution to be implemented in hardware and scale to large core counts. The multi-resource management technique for coordinated management of cache partitioning bandwidth partitioning and prefetching is designed using the results from our in-depth characterisation from the entire SPEC CPU2006 suite. The solution consists of three local resource management techniques that together with a coordination mechanism provides allocations which takes the inter-resource interactions and trade-offs into account.Our evaluation shows that the distributed cache partitioning solution performs within 1% from the best known centralized solution, which cannot scale to large core counts. The solution improves performance by 9% and 16%, on average, on a 16 and 64-core multi-core architecture, respectively, compared to a shared last-level cache. The multi-resource management technique gives a performance increase of 11%, on average, over state-of-the-art and improves performance by 50% compared to the baseline 16-core multi-core without cache partitioning, bandwidth partitioning and prefetch throttling

    Implementing a hybrid SRAM / eDRAM NUCA architecture

    Get PDF
    In this paper, we propose a hybrid cache architecture that exploits the main features of both memory technologies, speed of SRAM and high density of eDRAM. We demonstrate, that due to the high locality found in emerging applications, a high percentage of data that enters to the on-chip last-level cache are not accessed again before they are replacedPreprin

    PS-Architecture: A scalable and energy-efficient architecture for CMP NUCAs

    Full text link
    As the number of cores increases in both incoming and future shared-memory chip--multiprocessor (CMP) generations, coherence protocols and all elements in the cache hierarchy must scale to sustain performance. In this work we attack the scalability problem in the CMPs by studying and proposing some improvements for two of those elements, namely the directory and data caches. Each of these two structures have its particular issues which we try to solve employing some mechanisms involving the different type of blocks that can be found in parallel workloads. We introduce the PS directory, a directory cache that uses two different cache structures, each one tailored to one of these types of blocks (i.e., private and shared). The Shared directory cache, which tracks shared blocks is small, with low associativity and fast. The Private directory cache is aimed at tracking private blocks, which are highly dominant in current workloads. This structure does not store the sharer vector, is larger than the shared cache, and it has higher associativity. We also introduce the PS cache, an energy-efficient cache design which only accesses a subset of the set ways without hurting performance.Valls MompĂł, JJ. (2013). PS-Architecture: A scalable and energy-efficient architecture for CMP NUCAs. http://hdl.handle.net/10251/44383Archivo delegad

    CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips

    Get PDF
    Manycore chips are emerging as the architecture of choice to provide power-scalability and improve performance while riding the Moore’s law. On-chip interconnects are increasingly playing a pivotal role in power- and performance- scalability of such microarchitectures. As supply voltages begin to level off in future technologies, chip designs in general and interconnects in particular are resorting to specialization to provide power- and performance-scalability. In this paper, we make the observation that cache-coherent manycore chips exhibit a duality in on-chip network traffic. Request traffic typically consists of control packets requiring narrow low-power switches, while response traffic often carries cache block-sized payloads that require wider and higher-power switches. We present Cache-Coherence Network-on-Chip (CCNoC), a design to capitalize on this duality in traffic and provide a pair of asymmetric switches that optimize power and performance over conventional onchip interconnects. Cycle-accurate simulation results for a 4x4 chip multiprocessor with a shared last-level cache running commercial server workloads indicate 22% improvement in power over a torus and 38% improvement in power over a mesh with larger channel width, while providing similar performance

    Increasing the effectiveness of directory caches by avoiding the tracking of noncoherent memory blocks

    Full text link
    © 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.A key aspect in the design of efficient multiprocessor systems is the cache coherence protocol. Although directory-based protocols constitute the most scalable approach, the limited size of the directory caches together with the growing size of systems may cause frequent evictions and, consequently, the invalidation of cached blocks, which jeopardizes system performance. Directory caches keep track of every memory block stored in processor caches in order to provide coherent access to the shared memory. However, a significant fraction of the cached memory blocks do not require coherence maintenance (even in parallel applications) because they are either accessed by just one processor or they are never modified. In this paper, we propose to deactivate the coherence protocol for those blocks that do not require coherence. This deactivation means directory caches do not have to keep track of noncoherent blocks, which reduces directory cache occupancy and increases its effectiveness. Since the detection of noncoherent blocks is carried out by the operating system, our proposal only requires minor hardware modifications. Simulation results show that, thanks to our proposal, directory caches can avoid the tracking of about 66 percent (on average) of the blocks accessed by a wide range of applications, thereby improving the efficiency of directory caches. This contributes either to shortening the runtime of parallel applications by 15 percent (on average) while keeping directory cache size or to maintaining performance while using directory caches 16 times smaller.This work was supported by the Spanish MICINN, Consolider Programme and Plan E funds, as well as European Commission FEDER funds, under Grants CSD2006-00046 and TIN2009-14475-C04-01. It was also partly supported by (PROMETEO from Generalitat Valenciana (GVA) under Grant ROMETEO/2008/060). B. Cuesta was with Universitat Politecnica de Valencia while working on this paper.Cuesta Sáez, BA.; Ros Bardisa, A.; Gómez Requena, ME.; Robles Martínez, A.; Duato Marín, JF. (2013). Increasing the effectiveness of directory caches by avoiding the tracking of noncoherent memory blocks. IEEE Transactions on Computers. 62(3):482-495. https://doi.org/10.1109/TC.2011.241S48249562

    A survey of emerging architectural techniques for improving cache energy consumption

    Get PDF
    The search goes on for another ground breaking phenomenon to reduce the ever-increasing disparity between the CPU performance and storage. There are encouraging breakthroughs in enhancing CPU performance through fabrication technologies and changes in chip designs but not as much luck has been struck with regards to the computer storage resulting in material negative system performance. A lot of research effort has been put on finding techniques that can improve the energy efficiency of cache architectures. This work is a survey of energy saving techniques which are grouped on whether they save the dynamic energy, leakage energy or both. Needless to mention, the aim of this work is to compile a quick reference guide of energy saving techniques from 2013 to 2016 for engineers, researchers and students
    corecore