959 research outputs found

    Adaptive memory hierarchies for next generation tiled microarchitectures

    Get PDF
    Les últimes dècades el rendiment dels processadors i de les memòries ha millorat a diferent ritme, limitant el rendiment dels processadors i creant el conegut memory gap. Sol·lucionar aquesta diferència de rendiment és un camp d'investigació d'actualitat i que requereix de noves sol·lucions. Una sol·lució a aquest problema són les memòries “cache”, que permeten reduïr l'impacte d'unes latències de memòria creixents i que conformen la jerarquia de memòria. La majoria de d'organitzacions de les “caches” estan dissenyades per a uniprocessadors o multiprcessadors tradicionals. Avui en dia, però, el creixent nombre de transistors disponible per xip ha permès l'aparició de xips multiprocessador (CMPs). Aquests xips tenen diferents propietats i limitacions i per tant requereixen de jerarquies de memòria específiques per tal de gestionar eficientment els recursos disponibles. En aquesta tesi ens hem centrat en millorar el rendiment i la eficiència energètica de la jerarquia de memòria per CMPs, des de les “caches” fins als controladors de memòria. A la primera part d'aquesta tesi, s'han estudiat organitzacions tradicionals per les “caches” com les privades o compartides i s'ha pogut constatar que, tot i que funcionen bé per a algunes aplicacions, un sistema que s'ajustés dinàmicament seria més eficient. Tècniques com el Cooperative Caching (CC) combinen els avantatges de les dues tècniques però requereixen un mecanisme centralitzat de coherència que té un consum energètic molt elevat. És per això que en aquesta tesi es proposa el Distributed Cooperative Caching (DCC), un mecanisme que proporciona coherència en CMPs i aplica el concepte del cooperative caching de forma distribuïda. Mitjançant l'ús de directoris distribuïts s'obté una sol·lució més escalable i que, a més, disposa d'un mecanisme de marcatge més flexible i eficient energèticament. A la segona part, es demostra que les aplicacions fan diferents usos de la “cache” i que si es realitza una distribució de recursos eficient es poden aprofitar els que estan infrautilitzats. Es proposa l'Elastic Cooperative Caching (ElasticCC), una organització capaç de redistribuïr la memòria “cache” dinàmicament segons els requeriments de cada aplicació. Una de les contribucions més importants d'aquesta tècnica és que la reconfiguració es decideix completament a través del maquinari i que tots els mecanismes utilitzats es basen en estructures distribuïdes, permetent una millor escalabilitat. ElasticCC no només és capaç de reparticionar les “caches” segons els requeriments de cada aplicació, sinó que, a més a més, és capaç d'adaptar-se a les diferents fases d'execució de cada una d'elles. La nostra avaluació també demostra que la reconfiguració dinàmica de l'ElasticCC és tant eficient que gairebé proporciona la mateixa taxa de fallades que una configuració amb el doble de memòria.Finalment, la tesi es centra en l'estudi del comportament de les memòries DRAM i els seus controladors en els CMPs. Es demostra que, tot i que els controladors tradicionals funcionen eficientment per uniprocessadors, en CMPs els diferents patrons d'accés obliguen a repensar com estan dissenyats aquests sistemes. S'han presentat múltiples sol·lucions per CMPs però totes elles es veuen limitades per un compromís entre el rendiment global i l'equitat en l'assignació de recursos. En aquesta tesi es proposen els Thread Row Buffers (TRBs), una zona d'emmagatenament extra a les memòries DRAM que permetria guardar files de dades específiques per a cada aplicació. Aquest mecanisme permet proporcionar un accés equitatiu a la memòria sense perjudicar el seu rendiment global. En resum, en aquesta tesi es presenten noves organitzacions per la jerarquia de memòria dels CMPs centrades en la escalabilitat i adaptativitat als requeriments de les aplicacions. Els resultats presentats demostren que les tècniques proposades proporcionen un millor rendiment i eficiència energètica que les millors tècniques existents fins a l'actualitat.Processor performance and memory performance have improved at different rates during the last decades, limiting processor performance and creating the well known "memory gap". Solving this performance difference is an important research field and new solutions must be proposed in order to have better processors in the future. Several solutions exist, such as caches, that reduce the impact of longer memory accesses and conform the system memory hierarchy. However, most of the existing memory hierarchy organizations were designed for single processors or traditional multiprocessors. Nowadays, the increasing number of available transistors has allowed the apparition of chip multiprocessors, which have different constraints and require new ad-hoc memory systems able to efficiently manage memory resources. Therefore, in this thesis we have focused on improving the performance and energy efficiency of the memory hierarchy of chip multiprocessors, ranging from caches to DRAM memories. In the first part of this thesis we have studied traditional cache organizations such as shared or private caches and we have seen that they behave well only for some applications and that an adaptive system would be desirable. State-of-the-art techniques such as Cooperative Caching (CC) take advantage of the benefits of both worlds. This technique, however, requires the usage of a centralized coherence structure and has a high energy consumption. Therefore we propose the Distributed Cooperative Caching (DCC), a mechanism to provide coherence to chip multiprocessors and apply the concept of cooperative caching in a distributed way. Through the usage of distributed directories we obtain a more scalable solution and, in addition, has a more flexible and energy-efficient tag allocation method. We also show that applications make different uses of cache and that an efficient allocation can take advantage of unused resources. We propose Elastic Cooperative Caching (ElasticCC), an adaptive cache organization able to redistribute cache resources dynamically depending on application requirements. One of the most important contributions of this technique is that adaptivity is fully managed by hardware and that all repartitioning mechanisms are based on distributed structures, allowing a better scalability. ElasticCC not only is able to repartition cache sizes to application requirements, but also is able to dynamically adapt to the different execution phases of each thread. Our experimental evaluation also has shown that the cache partitioning provided by ElasticCC is efficient and is almost able to match the off-chip miss rate of a configuration that doubles the cache space. Finally, we focus in the behavior of DRAM memories and memory controllers in chip multiprocessors. Although traditional memory schedulers work well for uniprocessors, we show that new access patterns advocate for a redesign of some parts of DRAM memories. Several organizations exist for multiprocessor DRAM schedulers, however, all of them must trade-off between memory throughput and fairness. We propose Thread Row Buffers, an extended storage area in DRAM memories able to store a data row for each thread. This mechanism enables a fair memory access scheduling without hurting memory throughput. Overall, in this thesis we present new organizations for the memory hierarchy of chip multiprocessors which focus on the scalability and of the proposed structures and adaptivity to application behavior. Results show that the presented techniques provide a better performance and energy-efficiency than existing state-of-the-art solutions

    Temperature-Aware Runtime Power Management for Chip-Multiprocessors with 3-D Stacked Cache

    Get PDF
    The advent of 3-D fabrication technology makes it possible to stack a large amount of last-level cache memory onto a multi-core die to reduce off-chip memory accesses and, thus, increases system performance. However, the higher power density (i.e., power dissipation per unit volume) of 3-D integrated circuits (ICs) might incur temperature-related problems in reliability, leakage power, system performance, and cooling cost. In this paper, we propose a runtime solution to maximize the performance (i.e., instruction throughput) of chip-multiprocessors with 3-D stacked last-level cache memory, without thermal-constraint violation. The proposed method combines runtime cache tuning (e.g., cache-way partitioning, cache-way power-gating, cache data placement) with per-core dynamic voltage/frequency scaling (DVFS) in a temperature-aware manner. Experimental results show that the integrated method offers 23% performance improvement on average in terms of instructions per second (IPS) compared with temperature-aware runtime cache tuning only

    ATMP: An Adaptive Tolerance-based Mixed-criticality Protocol for Multi-core Systems

    Get PDF
    © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted ncomponent of this work in other works.The challenge of mixed-criticality scheduling is to keep tasks of higher criticality running in case of resource shortages caused by faults. Traditionally, mixedcriticality scheduling has focused on methods to handle faults where tasks overrun their optimistic worst-case execution time (WCET) estimate. In this paper we present the Adaptive Tolerance based Mixed-criticality Protocol (ATMP), which generalises the concept of mixed-criticality scheduling to handle also faults of other nature, like failure of cores in a multi-core system. ATMP is an adaptation method triggered by resource shortage at runtime. The first step of ATMP is to re-partition the task to the available cores and the second step is to optimise the utility at each core using the tolerance-based real-time computing model (TRTCM). The evaluation shows that the utility optimisation of ATMP can achieve a smoother degradation of service compared to just abandoning tasks

    A Survey of Techniques For Improving Energy Efficiency in Embedded Computing Systems

    Full text link
    Recent technological advances have greatly improved the performance and features of embedded systems. With the number of just mobile devices now reaching nearly equal to the population of earth, embedded systems have truly become ubiquitous. These trends, however, have also made the task of managing their power consumption extremely challenging. In recent years, several techniques have been proposed to address this issue. In this paper, we survey the techniques for managing power consumption of embedded systems. We discuss the need of power management and provide a classification of the techniques on several important parameters to highlight their similarities and differences. This paper is intended to help the researchers and application-developers in gaining insights into the working of power management techniques and designing even more efficient high-performance embedded systems of tomorrow

    The potential of programmable logic in the middle: cache bleaching

    Full text link
    Consolidating hard real-time systems onto modern multi-core Systems-on-Chip (SoC) is an open challenge. The extensive sharing of hardware resources at the memory hierarchy raises important unpredictability concerns. The problem is exacerbated as more computationally demanding workload is expected to be handled with real-time guarantees in next-generation Cyber-Physical Systems (CPS). A large body of works has approached the problem by proposing novel hardware re-designs, and by proposing software-only solutions to mitigate performance interference. Strong from the observation that unpredictability arises from a lack of fine-grained control over the behavior of shared hardware components, we outline a promising new resource management approach. We demonstrate that it is possible to introduce Programmable Logic In-the-Middle (PLIM) between a traditional multi-core processor and main memory. This provides the unique capability of manipulating individual memory transactions. We propose a proof-of-concept system implementation of PLIM modules on a commercial multi-core SoC. The PLIM approach is then leveraged to solve long-standing issues with cache coloring. Thanks to PLIM, colored sparse addresses can be re-compacted in main memory. This is the base principle behind the technique we call Cache Bleaching. We evaluate our design on real applications and propose hypervisor-level adaptations to showcase the potential of the PLIM approach.Accepted manuscrip

    Exploiting heterogeneity in Chip-Multiprocessor Design

    Get PDF
    In the past decade, semiconductor manufacturers are persistent in building faster and smaller transistors in order to boost the processor performance as projected by Moore’s Law. Recently, as we enter the deep submicron regime, continuing the same processor development pace becomes an increasingly difficult issue due to constraints on power, temperature, and the scalability of transistors. To overcome these challenges, researchers propose several innovations at both architecture and device levels that are able to partially solve the problems. These diversities in processor architecture and manufacturing materials provide solutions to continuing Moore’s Law by effectively exploiting the heterogeneity, however, they also introduce a set of unprecedented challenges that have been rarely addressed in prior works. In this dissertation, we present a series of in-depth studies to comprehensively investigate the design and optimization of future multi-core and many-core platforms through exploiting heteroge-neities. First, we explore a large design space of heterogeneous chip multiprocessors by exploiting the architectural- and device-level heterogeneities, aiming to identify the optimal design patterns leading to attractive energy- and cost-efficiencies in the pre-silicon stage. After this high-level study, we pay specific attention to the architectural asymmetry, aiming at developing a heterogeneity-aware task scheduler to optimize the energy-efficiency on a given single-ISA heterogeneous multi-processor. An advanced statistical tool is employed to facilitate the algorithm development. In the third study, we shift our concentration to the device-level heterogeneity and propose to effectively leverage the advantages provided by different materials to solve the increasingly important reliability issue for future processors
    corecore