255 research outputs found

    Fault- and Yield-Aware On-Chip Memory Design and Management

    Get PDF
    Ever decreasing device size causes more frequent hard faults, which becomes a serious burden to processor design and yield management. This problem is particularly pronounced in the on-chip memory which consumes up to 70% of a processor' s total chip area. Traditional circuit-level techniques, such as redundancy and error correction code, become less effective in error-prevalent environments because of their large area overhead. In this work, we suggest an architectural solution to building reliable on-chip memory in the future processor environment. Our approaches have two parts, a design framework and architectural techniques for on-chip memory structures. Our design framework provides important architectural evaluation metrics such as yield, area, and performance based on low level defects and process variations parameters. Processor architects can quickly evaluate their designs' characteristics in terms of yield, area, and performance. With the framework, we develop architectural yield enhancement solutions for on-chip memory structures including L1 cache, L2 cache and directory memory. Our proposed solutions greatly improve yield with negligible area and performance overhead. Furthermore, we develop a decoupled yield model of compute cores and L2 caches in CMPs, which show that there will be many more L2 caches than compute cores in a chip. We propose efficient utilization techniques for excess caches. Evaluation results show that excess caches significantly improve overall performance of CMPs

    L2C2: Last-level compressed-contents non-volatile cache and a procedure to forecast performance and lifetime

    Get PDF
    Several emerging non-volatile (NV) memory technologies are rising as interesting alternatives to build the Last-Level Cache (LLC). Their advantages, compared to SRAM memory, are higher density and lower static power, but write operations wear out the bitcells to the point of eventually losing their storage capacity. In this context, this paper presents a novel LLC organization designed to extend the lifetime of the NV data array and a procedure to forecast in detail the capacity and performance of such an NV-LLC over its lifetime. From a methodological point of view, although different approaches are used in the literature to analyze the degradation of an NV-LLC, none of them allows to study in detail its temporal evolution. In this sense, this work proposes a forecasting procedure that combines detailed simulation and prediction, allowing an accurate analysis of the impact of different cache control policies and mechanisms (replacement, wear-leveling, compression, etc.) on the temporal evolution of the indices of interest, such as the effective capacity of the NV-LLC or the system IPC. We also introduce L2C2, a LLC design intended for implementation in NV memory technology that combines fault tolerance, compression, and internal write wear leveling for the first time. Compression is not used to store more blocks and increase the hit rate, but to reduce the write rate and increase the lifetime during which the cache supports near-peak performance. In addition, to support byte loss without performance drop, L2C2 inherently allows N redundant bytes to be added to each cache entry. Thus, L2C2+N, the endurance-scaled version of L2C2, allows balancing the cost of redundant capacity with the benefit of longer lifetime. For instance, as a use case, we have implemented the L2C2 cache with STT-RAM technology. It has affordable hardware overheads compared to that of a baseline NV-LLC without compression in terms of area, latency and energy consumption, and increases up to 6-37 times the time in which 50% of the effective capacity is degraded, depending on the variability in the manufacturing process. Compared to L2C2, L2C2+6 which adds 6 bytes of redundant capacity per entry, that means 9.1% of storage overhead, can increase up to 1.4-4.3 times the time in which the system gets its initial peak performance degraded

    On-Demand Solution to Minimize I-Cache Leakage Energy with Maintaining Performance

    Full text link

    Multi-criteria optimization for energy-efficient multi-core systems-on-chip

    Get PDF
    The steady down-scaling of transistor dimensions has made possible the evolutionary progress leading to today’s high-performance multi-GHz microprocessors and core based System-on-Chip (SoC) that offer superior performance, dramatically reduced cost per function, and much-reduced physical size compared to their predecessors. On the negative side, this rapid scaling however also translates to high power densities, higher operating temperatures and reduced reliability making it imperative to address design issues that have cropped up in its wake. In particular, the aggressive physical miniaturization have increased CMOS fault sensitivity to the extent that many reliability constraints pose threat to the device normal operation and accelerate the onset of wearout-based failures. Among various wearout-based failure mechanisms, Negative biased temperature instability (NBTI) has been recognized as the most critical source of device aging. The urge of reliable, low-power circuits is driving the EDA community to develop new design techniques, circuit solutions, algorithms, and software, that can address these critical issues. Unfortunately, this challenge is complicated by the fact that power and reliability are known to be intrinsically conflicting metrics: traditional solutions to improve reliability such as redundancy, increase of voltage levels, and up-sizing of critical devices do contrast with traditional low-power solutions, which rely on compact architectures, scaled supply voltages, and small devices. This dissertation focuses on methodologies to bridge this gap and establishes an important link between low-power solutions and aging effects. More specifically, we proposed new architectural solutions based on power management strategies to enable the design of low-power, aging aware cache memories. Cache memories are one of the most critical components for warranting reliable and timely operation. However, they are also more susceptible to aging effects. Due to symmetric structure of a memory cell, aging occurs regardless of the fact that a cell (or word) is accessed or not. Moreover, aging is a worst-case matric and line with worst-case access pattern determines the aging of the entire cache. In order to stop the aging of a memory cell, it must be put into a proper idle state when a cell (or word) is not accessed which require proper management of the idleness of each atomic unit of power management. We have proposed several reliability management techniques based on the idea of cache partitioning to alleviate NBTI-induced aging and obtain joint energy and lifetime benefits. We introduce graceful degradation mechanism which allows different cache blocks into which a cache is partitioned to age at different rates. This implies that various sub-blocks become unreliable at different times, and the cache keeps functioning with reduced efficiency. We extended the capabilities of this architecture by integrating the concept of reconfigurable caches to maintain the performance of the cache throughout its lifetime. By this strategy, whenever a block becomes unreliable, the remaining cache is reconfigured to work as a smaller size cache with only a marginal degradation of performance. Many mission-critical applications require guaranteed lifetime of their operations and therefore the hardware implementing their functionality. Such constraints are usually enforced by means of various reliability enhancing solutions mostly based on redundancy which are not energy-friendly. In our work, we have proposed a novel cache architecture in which a smart use of cache partitions for redundancy allows us to obtain cache that meet a desired lifetime target with minimal energy consumption

    Fault and Defect Tolerant Computer Architectures: Reliable Computing With Unreliable Devices

    Get PDF
    This research addresses design of a reliable computer from unreliable device technologies. A system architecture is developed for a fault and defect tolerant (FDT) computer. Trade-offs between different techniques are studied and yield and hardware cost models are developed. Fault and defect tolerant designs are created for the processor and the cache memory. Simulation results for the content-addressable memory (CAM)-based cache show 90% yield with device failure probabilities of 3 x 10(-6), three orders of magnitude better than non fault tolerant caches of the same size. The entire processor achieves 70% yield with device failure probabilities exceeding 10(-6). The required hardware redundancy is approximately 15 times that of a non-fault tolerant design. While larger than current FT designs, this architecture allows the use of devices much more likely to fail than silicon CMOS. As part of model development, an improved model is derived for NAND Multiplexing. The model is the first accurate model for small and medium amounts of redundancy. Previous models are extended to account for dependence between the inputs and produce more accurate results

    Automatic synthesis and optimization of chip multiprocessors

    Get PDF
    The microprocessor technology has experienced an enormous growth during the last decades. Rapid downscale of the CMOS technology has led to higher operating frequencies and performance densities, facing the fundamental issue of power dissipation. Chip Multiprocessors (CMPs) have become the latest paradigm to improve the power-performance efficiency of computing systems by exploiting the parallelism inherent in applications. Industrial and prototype implementations have already demonstrated the benefits achieved by CMPs with hundreds of cores.CMP architects are challenged to take many complex design decisions. Only a few of them are:- What should be the ratio between the core and cache areas on a chip?- Which core architectures to select?- How many cache levels should the memory subsystem have?- Which interconnect topologies provide efficient on-chip communication?These and many other aspects create a complex multidimensional space for architectural exploration. Design Automation tools become essential to make the architectural exploration feasible under the hard time-to-market constraints. The exploration methods have to be efficient and scalable to handle future generation on-chip architectures with hundreds or thousands of cores.Furthermore, once a CMP has been fabricated, the need for efficient deployment of the many-core processor arises. Intelligent techniques for task mapping and scheduling onto CMPs are necessary to guarantee the full usage of the benefits brought by the many-core technology. These techniques have to consider the peculiarities of the modern architectures, such as availability of enhanced power saving techniques and presence of complex memory hierarchies.This thesis has several objectives. The first objective is to elaborate the methods for efficient analytical modeling and architectural design space exploration of CMPs. The efficiency is achieved by using analytical models instead of simulation, and replacing the exhaustive exploration with an intelligent search strategy. Additionally, these methods incorporate high-level models for physical planning. The related contributions are described in Chapters 3, 4 and 5 of the document.The second objective of this work is to propose a scalable task mapping algorithm onto general-purpose CMPs with power management techniques, for efficient deployment of many-core systems. This contribution is explained in Chapter 6 of this document.Finally, the third objective of this thesis is to address the issues of the on-chip interconnect design and exploration, by developing a model for simultaneous topology customization and deadlock-free routing in Networks-on-Chip. The developed methodology can be applied to various classes of the on-chip systems, ranging from general-purpose chip multiprocessors to application-specific solutions. Chapter 7 describes the proposed model.The presented methods have been thoroughly tested experimentally and the results are described in this dissertation. At the end of the document several possible directions for the future research are proposed

    Cache designs for reliable hybrid high and ultra-low voltage operation

    Get PDF
    Increasing demand for implementing highly-miniaturized battery-powered ultra-low-cost systems (e.g., below 1 USD) in emerging applications such as body, urban life and environment monitoring, etc., has introduced many challenges in the chip design. Such applications require high performance occasionally, but very little energy consumption during most of the time in order to extend battery lifetime. In addition, they require real-time guarantees. The most suitable technological solution for those devices consists of using hybrid processors able to operate at: (i) high voltage to provide high performance and (ii) near-/sub-threshold (NST) voltage to provide ultra-low energy consumption. However, the most efficient SRAM memories for each voltage level differ and it is mandatory trading off different SRAM designs, especially in cache memories, which occupy most of the processorÂżs area. In this Thesis, we analyze the performance/power tradeoffs involved in the design of SRAM L1 caches for reliable hybrid high and NST Vcc operation from a microarchitectural perspective. We develop new, simple, single-Vcc domain hybrid cache architectures and data management mechanisms that satisfy all stringent needs of our target market. Proposed solutions are shown to have high energy efficiency with negligible impact on average performance while maintaining strong performance guarantees as required for our target market

    Simulation Of Multi-core Systems And Interconnections And Evaluation Of Fat-Mesh Networks

    Get PDF
    Simulators are very important in computer architecture research as they enable the exploration of new architectures to obtain detailed performance evaluation without building costly physical hardware. Simulation is even more critical to study future many-core architectures as it provides the opportunity to assess currently non-existing computer systems. In this thesis, a multiprocessor simulator is presented based on a cycle accurate architecture simulator called SESC. The shared L2 cache system is extended into a distributed shared cache (DSC) with a directory-based cache coherency protocol. A mesh network module is extended and integrated into SESC to replace the bus for scalable inter-processor communication. While these efforts complete an extended multiprocessor simulation infrastructure, two interconnection enhancements are proposed and evaluated. A novel non-uniform fat-mesh network structure similar to the idea of fat-tree is proposed. This non-uniform mesh network takes advantage of the average traffic pattern, typically all-to-all in DSC, to dedicate additional links for connections with heavy traffic (e.g., near the center) and fewer links for lighter traffic (e.g., near the periphery). Two fat-mesh schemes are implemented based on different routing algorithms. Analytical fat-mesh models are constructed by presenting the expressions for the traffic requirements of personalized all-to-all traffic. Performance improvements over the uniform mesh are demonstrated in the results from the simulator. A hybrid network consisting of one packet switching plane and multiple circuit switching planes is constructed as the second enhancement. The circuit switching planes provide fast paths between neighbors with heavy communication traffic. A compiler technique that abstracts the symbolic expressions of benchmarks' communication patterns can be used to help facilitate the circuit establishment

    Support for Programming Models in Network-on-Chip-based Many-core Systems

    Get PDF

    Handling Information and its Propagation to Engineer Complex Embedded Systems

    Get PDF
    Avec l’intérêt que la technologie d’aujourd’hui a sur les données, il est facile de supposer que l’information est au bout des doigts, prêt à être exploité. Les méthodologies et outils de recherche sont souvent construits sur cette hypothèse. Cependant, cette illusion d’abondance se brise souvent lorsqu’on tente de transférer des techniques existantes à des applications industrielles. Par exemple, la recherche a produit divers méthodologies permettant d’optimiser l’utilisation des ressources de grands systèmes complexes, tels que les avioniques de l’Airbus A380. Ces approches nécessitent la connaissance de certaines mesures telles que les temps d’exécution, la consommation de mémoire, critères de communication, etc. La conception de ces systèmes complexes a toutefois employé une combinaison de compétences de différents domaines (probablement avec des connaissances en génie logiciel) qui font que les données caractéristiques au système sont incomplètes ou manquantes. De plus, l’absence d’informations pertinentes rend difficile de décrire correctement le système, de prédire son comportement, et améliorer ses performances. Nous faisons recours au modèles probabilistes et des techniques d’apprentissage automatique pour remédier à ce manque d’informations pertinentes. La théorie des probabilités, en particulier, a un grand potentiel pour décrire les systèmes partiellement observables. Notre objectif est de fournir des approches et des solutions pour produire des informations pertinentes. Cela permet une description appropriée des systèmes complexes pour faciliter l’intégration, et permet l’utilisation des techniques d’optimisation existantes. Notre première étape consiste à résoudre l’une des difficultés rencontrées lors de l’intégration de système : assurer le bon comportement temporelle des composants critiques des systèmes. En raison de la mise à l’échelle de la technologie et de la dépendance croissante à l’égard des architectures à multi-coeurs, la surcharge de logiciels fonctionnant sur différents coeurs et le partage d’espace mémoire n’est plus négligeable. Pour tel, nous étendons la boîte à outils des système temps réel avec une analyse temporelle probabiliste statique qui estime avec précision l’exécution d’un logiciel avec des considerations pour les conflits de mémoire partagée. Le modèle est ensuite intégré dans un simulateur pour l’ordonnancement de systèmes temps réel multiprocesseurs. ----------ABSTRACT: In today’s data-driven technology, it is easy to assume that information is at the tip of our fingers, ready to be exploited. Research methodologies and tools are often built on top of this assumption. However, this illusion of abundance often breaks when attempting to transfer existing techniques to industrial applications. For instance, research produced various methodologies to optimize the resource usage of large complex systems, such as the avionics of the Airbus A380. These approaches require the knowledge of certain metrics such as the execution time, memory consumption, communication delays, etc. The design of these complex systems, however, employs a mix of expertise from different fields (likely with limited knowledge in software engineering) which might lead to incomplete or missing specifications. Moreover, the unavailability of relevant information makes it difficult to properly describe the system, predict its behavior, and improve its performance. We fall back on probabilistic models and machine learning techniques to address this lack of relevant information. Probability theory, especially, has great potential to describe partiallyobservable systems. Our objective is to provide approaches and solutions to produce relevant information. This enables a proper description of complex systems to ease integration, and allows the use of existing optimization techniques. Our first step is to tackle one of the difficulties encountered during system integration: ensuring the proper timing behavior of critical systems. Due to technology scaling, and with the growing reliance on multi-core architectures, the overhead of software running on different cores and sharing memory space is no longer negligible. For such, we extend the real-time system tool-kit with a static probabilistic timing analysis technique that accurately estimates the execution of software with an awareness of shared memory contention. The model is then incorporated into a simulator for scheduling multi-processor real-time systems
    • …
    corecore