110 research outputs found

    Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors

    Get PDF
    This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs

    Adaptive memory hierarchies for next generation tiled microarchitectures

    Get PDF
    Les últimes dècades el rendiment dels processadors i de les memòries ha millorat a diferent ritme, limitant el rendiment dels processadors i creant el conegut memory gap. Sol·lucionar aquesta diferència de rendiment és un camp d'investigació d'actualitat i que requereix de noves sol·lucions. Una sol·lució a aquest problema són les memòries “cache”, que permeten reduïr l'impacte d'unes latències de memòria creixents i que conformen la jerarquia de memòria. La majoria de d'organitzacions de les “caches” estan dissenyades per a uniprocessadors o multiprcessadors tradicionals. Avui en dia, però, el creixent nombre de transistors disponible per xip ha permès l'aparició de xips multiprocessador (CMPs). Aquests xips tenen diferents propietats i limitacions i per tant requereixen de jerarquies de memòria específiques per tal de gestionar eficientment els recursos disponibles. En aquesta tesi ens hem centrat en millorar el rendiment i la eficiència energètica de la jerarquia de memòria per CMPs, des de les “caches” fins als controladors de memòria. A la primera part d'aquesta tesi, s'han estudiat organitzacions tradicionals per les “caches” com les privades o compartides i s'ha pogut constatar que, tot i que funcionen bé per a algunes aplicacions, un sistema que s'ajustés dinàmicament seria més eficient. Tècniques com el Cooperative Caching (CC) combinen els avantatges de les dues tècniques però requereixen un mecanisme centralitzat de coherència que té un consum energètic molt elevat. És per això que en aquesta tesi es proposa el Distributed Cooperative Caching (DCC), un mecanisme que proporciona coherència en CMPs i aplica el concepte del cooperative caching de forma distribuïda. Mitjançant l'ús de directoris distribuïts s'obté una sol·lució més escalable i que, a més, disposa d'un mecanisme de marcatge més flexible i eficient energèticament. A la segona part, es demostra que les aplicacions fan diferents usos de la “cache” i que si es realitza una distribució de recursos eficient es poden aprofitar els que estan infrautilitzats. Es proposa l'Elastic Cooperative Caching (ElasticCC), una organització capaç de redistribuïr la memòria “cache” dinàmicament segons els requeriments de cada aplicació. Una de les contribucions més importants d'aquesta tècnica és que la reconfiguració es decideix completament a través del maquinari i que tots els mecanismes utilitzats es basen en estructures distribuïdes, permetent una millor escalabilitat. ElasticCC no només és capaç de reparticionar les “caches” segons els requeriments de cada aplicació, sinó que, a més a més, és capaç d'adaptar-se a les diferents fases d'execució de cada una d'elles. La nostra avaluació també demostra que la reconfiguració dinàmica de l'ElasticCC és tant eficient que gairebé proporciona la mateixa taxa de fallades que una configuració amb el doble de memòria.Finalment, la tesi es centra en l'estudi del comportament de les memòries DRAM i els seus controladors en els CMPs. Es demostra que, tot i que els controladors tradicionals funcionen eficientment per uniprocessadors, en CMPs els diferents patrons d'accés obliguen a repensar com estan dissenyats aquests sistemes. S'han presentat múltiples sol·lucions per CMPs però totes elles es veuen limitades per un compromís entre el rendiment global i l'equitat en l'assignació de recursos. En aquesta tesi es proposen els Thread Row Buffers (TRBs), una zona d'emmagatenament extra a les memòries DRAM que permetria guardar files de dades específiques per a cada aplicació. Aquest mecanisme permet proporcionar un accés equitatiu a la memòria sense perjudicar el seu rendiment global. En resum, en aquesta tesi es presenten noves organitzacions per la jerarquia de memòria dels CMPs centrades en la escalabilitat i adaptativitat als requeriments de les aplicacions. Els resultats presentats demostren que les tècniques proposades proporcionen un millor rendiment i eficiència energètica que les millors tècniques existents fins a l'actualitat.Processor performance and memory performance have improved at different rates during the last decades, limiting processor performance and creating the well known "memory gap". Solving this performance difference is an important research field and new solutions must be proposed in order to have better processors in the future. Several solutions exist, such as caches, that reduce the impact of longer memory accesses and conform the system memory hierarchy. However, most of the existing memory hierarchy organizations were designed for single processors or traditional multiprocessors. Nowadays, the increasing number of available transistors has allowed the apparition of chip multiprocessors, which have different constraints and require new ad-hoc memory systems able to efficiently manage memory resources. Therefore, in this thesis we have focused on improving the performance and energy efficiency of the memory hierarchy of chip multiprocessors, ranging from caches to DRAM memories. In the first part of this thesis we have studied traditional cache organizations such as shared or private caches and we have seen that they behave well only for some applications and that an adaptive system would be desirable. State-of-the-art techniques such as Cooperative Caching (CC) take advantage of the benefits of both worlds. This technique, however, requires the usage of a centralized coherence structure and has a high energy consumption. Therefore we propose the Distributed Cooperative Caching (DCC), a mechanism to provide coherence to chip multiprocessors and apply the concept of cooperative caching in a distributed way. Through the usage of distributed directories we obtain a more scalable solution and, in addition, has a more flexible and energy-efficient tag allocation method. We also show that applications make different uses of cache and that an efficient allocation can take advantage of unused resources. We propose Elastic Cooperative Caching (ElasticCC), an adaptive cache organization able to redistribute cache resources dynamically depending on application requirements. One of the most important contributions of this technique is that adaptivity is fully managed by hardware and that all repartitioning mechanisms are based on distributed structures, allowing a better scalability. ElasticCC not only is able to repartition cache sizes to application requirements, but also is able to dynamically adapt to the different execution phases of each thread. Our experimental evaluation also has shown that the cache partitioning provided by ElasticCC is efficient and is almost able to match the off-chip miss rate of a configuration that doubles the cache space. Finally, we focus in the behavior of DRAM memories and memory controllers in chip multiprocessors. Although traditional memory schedulers work well for uniprocessors, we show that new access patterns advocate for a redesign of some parts of DRAM memories. Several organizations exist for multiprocessor DRAM schedulers, however, all of them must trade-off between memory throughput and fairness. We propose Thread Row Buffers, an extended storage area in DRAM memories able to store a data row for each thread. This mechanism enables a fair memory access scheduling without hurting memory throughput. Overall, in this thesis we present new organizations for the memory hierarchy of chip multiprocessors which focus on the scalability and of the proposed structures and adaptivity to application behavior. Results show that the presented techniques provide a better performance and energy-efficiency than existing state-of-the-art solutions

    Hybrid Caching for Chip Multiprocessors Using Compiler-Based Data Classification

    Get PDF
    The high performance delivered by modern computer system keeps scaling with an increasingnumber of processors connected using distributed network on-chip. As a result, memory accesslatency, largely dominated by remote data cache access and inter-processor communication, is becoming a critical performance bottleneck. To release this problem, it is necessary to localize data access as much as possible while keep efficient on-chip cache memory utilization. Achieving this however, is application dependent and needs a keen insight into the memory access characteristics of the applications. This thesis demonstrates how using fairly simple thus inexpensive compiler analysis memory accesses can be classified into private data access and shared data access. In addition, we introduce a third classification named probably private access and demonstrate the impact of this category compared to traditional private and shared memory classification. The memory access classification information from the compiler analysis is then provided to the runtime system through a modified memory allocator and page table to facilitate a hybrid private-shared caching technique. The hybrid cache mechanism is aware of different data access classification and adopts appropriate placement and search policies accordingly to improve performance. Our analysis demonstrates that many applications have a significant amount of both private and shared data and that compiler analysis can identify the private data effectively for many applications. Experimentsresults show that the implemented hybrid caching scheme achieves 4.03% performance improvement over state of the art NUCA-base caching

    Hardware-Oriented Cache Management for Large-Scale Chip Multiprocessors

    Get PDF
    One of the key requirements to obtaining high performance from chip multiprocessors (CMPs) is to effectively manage the limited on-chip cache resources shared among co-scheduled threads/processes. This thesis proposes new hardware-oriented solutions for distributed CMP caches. Computer architects are faced with growing challenges when designing cache systems for CMPs. These challenges result from non-uniform access latencies, interference misses, the bandwidth wall problem, and diverse workload characteristics. Our exploration of the CMP cache management problem suggests a CMP caching framework (CC-FR) that defines three main approaches to solve the problem: (1) data placement, (2) data retention, and (3) data relocation. We effectively implement CC-FR's components by proposing and evaluating multiple cache management mechanisms.Pressure and Distance Aware Placement (PDA) decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Flexible Set Balancing (FSB), on the other hand, reduces interference misses via extending the life time of cache lines through retaining some fraction of the working set at underutilized local sets to satisfy far-flung reuses. PDA implements CC-FR's data placement and relocation components and FSB applies CC-FR's retention approach.To alleviate non-uniform access latencies and adapt to phase changes in programs, Adaptive Controlled Migration (ACM) dynamically and periodically promotes cache blocks towards L2 banks close to requesting cores. ACM lies under CC-FR's data relocation category. Dynamic Cache Clustering (DCC), on the other hand, addresses diverse workload characteristics and growing non-uniform access latencies challenges via constructing a cache cluster for each core and expands/contracts all clusters synergistically to match each core's cache demand. DCC implements CC-FR's data placement and relocation approaches. Lastly, Dynamic Pressure and Distance Aware Placement (DPDA) combines PDA and ACM to cooperatively mitigate interference misses and non-uniform access latencies. Dynamic Cache Clustering and Balancing (DCCB), on the other hand, combines DCC and FSB to employ all CC-FR's categories and achieve higher system performance. Simulation results demonstrate the effectiveness of the proposed mechanisms and show that they compare favorably with related cache designs

    C-AMTE: A location mechanism for flexible cache management in chip multiprocessors

    Get PDF
    This paper describes Constrained Associative-Mapping-of-Tracking-Entries (C-AMTE), a scalable mechanism to facilitate flexible and efficient distributed cache management in large-scale chip multiprocessors (CMPs). C-AMTE enables fast locating of cache blocks in CMP cache schemes that employ one-to-one or one-to-many associative mappings. C-AMTE stores in per-core data structures tracking entries to avoid on-chip interconnect traffic outburst or long distance directory lookups. Simulation results using a full system simulator demonstrate that C-AMTE achieves improvement in cache access latency by up to 34.4%, close to that of a perfect location strategy. © 2010 Elsevier Inc. All rights reserved

    Jigsaw: Scalable software-defined caches

    Get PDF
    Shared last-level caches, widely used in chip-multi-processors (CMPs), face two fundamental limitations. First, the latency and energy of shared caches degrade as the system scales up. Second, when multiple workloads share the CMP, they suffer from interference in shared cache accesses. Unfortunately, prior research addressing one issue either ignores or worsens the other: NUCA techniques reduce access latency but are prone to hotspots and interference, and cache partitioning techniques only provide isolation but do not reduce access latency.United States. Defense Advanced Research Projects Agency (DARPA PERFECT contract HR0011-13-2-0005)Quanta Computer (Firm

    An Efficient Cache Organization for On-Chip Multiprocessor Networks

    Get PDF
    To meet the growing computation-intensive applications and the needs of low-power, high-performance systems, the number of computing resources in single-chip has enormously increased. By adding many computing resources to build a system in System-on-Chip, its interconnection between each other becomes another challenging issue. In most System-on-Chip applications, a shared bus interconnection which needs an arbitration logic to serialize several bus access requests, is adopted to communicate with each integrated processing unit because of its low-cost and simple control characteristics. This paper focuses on the interconnection design issues of area, power and performance of chip multi-processors with shared cache memory. It shows that having shared cache memory contributes to the performance improvement, however, typical interconnection between cores and the shared cache using crossbar occupies most of the chip area, consumes a lot of power and does not scale efficiently with increased number of cores. New interconnection mechanisms are needed to address these issues. This paper proposes an architectural paradigm in an attempt to gain the advantages of having shared cache with the avoidance of penalty imposed by the crossbar interconnect. The proposed architecture achieves smaller area occupation allowing more space to add additional cache memory. It also reduces power consumption compared to the existing crossbar architecture. Furthermore, the paper presents a modified cache coherence algorithm called Tuned-MESI. It is based on the typical MESI cache coherence algorithm however it is tuned and tailored for the suggested architecture. The achieved results of the conducted simulated experiments show that the developed architecture produces less broadcast operations compared to the typical algorithm

    Jigsaw: Scalable Software-Defined Caches (Extended Version)

    Get PDF
    Shared last-level caches, widely used in chip-multiprocessors (CMPs), face two fundamental limitations. First, the latency and energy of shared caches degrade as the system scales up. Second, when multiple workloads share the CMP, they suffer from interference in shared cache accesses. Unfortunately, prior research addressing one issue either ignores or worsens the other: NUCA techniques reduce access latency but are prone to hotspots and interference, and cache partitioning techniques only provide isolation but do not reduce access latency. We present Jigsaw, a technique that jointly addresses the scalability and interference problems of shared caches. Hardware lets software define shares, collections of cache bank partitions that act as virtual caches, and map data to shares. Shares give software full control over both data placement and capacity allocation. Jigsaw implements efficient hardware support for share management, monitoring, and adaptation. We propose novel resource-management algorithms and use them to develop a system-level runtime that leverages Jigsaw to both maximize cache utilization and place data close to where it is used. We evaluate Jigsaw using extensive simulations of 16- and 64-core tiled CMPs. Jigsaw improves performance by up to 2.2x (18% avg) over a conventional shared cache, and significantly outperforms state-of-the-art NUCA and partitioning techniques.This work was supported in part by DARPA PERFECT contract HR0011-13-2-0005 and Quanta Computer

    Scaling Single-Program Performance on Large-Scale Chip Multiprocessors

    Get PDF
    Due to power constraints, computer architects will exploit TLP instead of ILP for future performance gains. Today, 4-8 state-of-the-art cores or 10s of smaller cores can fit on a single die. For the foreseeable future, the number of cores will likely double with each successive processor generation. Hence, CMPs with 100s of cores-so-called large-scale chip multiprocessors (LCMPs)-will become a reality after only 2 or 3 generations. Unfortunately, simply scaling the number of on-chip cores alone will not guarantee improved performance. In addition, effectively utilizing all of the cores is also necessary. Perhaps the greatest threat to processor utilization will be the overhead incurred waiting on the memory system, especially as on-chip concurrency scales to 100s of threads. In particular, remote cache bank access and off-chip bandwidth contention are likely to be the most significant obstacles to scaling memory performance. This paper conducts an in-depth study of CMP scalability for parallel programs. We assume a tiled CMP in which tiles contain a simple core along with a private L1 cache and a local slice of a shared L2 cache. Our study considers scaling from 1-256 cores and 4-128MB of total L2 cache, and addresses several issues related to the impact of scaling on off-chip bandwidth and on-chip communication. In particular, we find off-chip bandwidth increases linearly with core count, but the rate of increase reduces dramatically once enough L2 cache is provided to capture inter-thread sharing. Our results also show for the range 1-256 cores, there should be ample on-chip bandwidth to support the communication requirements of our benchmarks. Finally, we find that applications become off-chip limited when their L2 cache miss rates exceed some minimum threshold. Moreover, we expect off-chip overheads to dominate on-chip overheads for memory intensive programs and LCMPs with aggressive cores
    corecore