21 research outputs found

    Leveraging data compression for performance-efficient and long-lasting NVM-based last-level cache

    Get PDF
    Non-volatile memory (NVM) technologies are interesting alternatives for building on-chip Last-Level Caches (LLCs). Their advantages, compared to SRAM memory, are higher density and lower static power, but each write operation slightly wears out the bitcell, to the point of losing its storage capacity. In this context, this paper summarizes three contributions to the state-of-the-art NVM-based LLCs. Data compression reduces the size of the blocks and, together with wear-leveling mechanisms, can defer the wear out NVMs. Moreover, as capacity is reduced by write wear, data compression enables degraded cache frames to allocate blocks whose compressed size is adequate. Our first contribution is a microarchitecture design that leverages data compression and an intra-frame wear-leveling to gracefully deal with NVM-LLCs capacity degradation. The second contribution leverages this microarchitecture design to propose new insertion policies for hybrid LLCs using Set Dueling and taking into account the compression capabilities of the blocks. From a methodological point of view, although different approaches are used in the literature to analyze the degradation of a NV-LLC, none of them allows to study in detail its temporal evolution. In this sense, the third contribution is a forecasting procedure that combines detailed simulation and prediction, enabling an accurate analysis of different cache content mechanisms (replacement, wear leveling, compression, etc.) on the temporal evolution of the performance of multiprocessor systems employing such NVM-LLCs. Using this forecasting procedure we show that the proposed NVM-LLCs organizations and the insertion policies for hybrid LLCs significantly outperform the state-of-the-art in both performance and lifetime metrics.Peer ReviewedPostprint (published version

    Cache Sharing Administration for Performance Fairness using D3C Miss Classification in Chip Multi-Processors

    Get PDF
    This work presents a study of fairness in cache sharing between processes in a chip multiprocessor (CMP). We propose a new algorithm that uses a metric based on the D3C miss classification and LRU Stack Distance, to measure the fairness in the administration of the resources to achieve an increase of the global IPC of all executed processes. Shared cache miss rate, IPC and bandwidth metrics were considered to analyze the simulation results obtained using three test sets. The obtained results showed that the proposed dynamic management policy compared to Capitalist management policy, has a lower global miss rate in shared cache and lower bandwidth usage for each test set studied and fulfills its objective of managing the shared cache space for every process while improving the overall IPC.Sociedad Argentina de Informática e Investigación Operativ

    Аналіз технології багатоядерного центрального процесора

    Get PDF
    Робота викладена на _30_ сторінках, у тому числі включає _9_ рисунків, список цитованої літератури із _19_ джерел.Об’єктом дослідження кваліфікаційної роботи є аналіз технології багатоядерного центрального процесора. Мета роботи полягає у аналізі технології що використовуються при побудові та створенню багатоядерного центрального процесора. При виконанні роботи було проаналізовано види, модифікації, переваги та недоліки технології багатоядерного центрального процесора, визначені фірми які є передовими в розробці багатоядерних процесорів, їхні особливості та принципи роботи з системою. У результаті проведення аналізу встановлено, що багатоядерний процесор є важливим доповненням у часовій шкалі мікропроцесорів та засобом вирішення проблем з високим споживанням енергії, ціною та розміром. Виділено архітектуру Intel та AMD, двох найважливіших постачальників сучасних багатоядерних систем, і порівняння зроблено з точки зору архітектур

    Interconnect and Memory Design for Intelligent Mobile System

    Full text link
    Technology scaling has driven the transistor to a smaller area, higher performance and lower power consuming which leads us into the mobile and edge computing era. However, the benefits of technology scaling are diminishing today, as the wire delay and energy scales far behind that of the logics, which makes communication more expensive than computation. Moreover, emerging data centric algorithms like deep learning have a growing demand on SRAM capacity and bandwidth. High access energy and huge leakage of the large on-chip SRAM have become the main limiter of realizing an energy efficient low power smart sensor platform. This thesis presents several architecture and circuit solutions to enable intelligent mobile systems, including voltage scalable interconnect scheme, Compute-In-Memory (CIM), low power memory system from edge deep learning processor and an ultra-low leakage stacked voltage domain SRAM for low power smart image signal processor (ISP). Four prototypes are implemented for demonstration and verification. The first two seek the solutions to the slow and high energy global on-chip interconnect: the first prototype proposes a reconfigurable self-timed regenerator based global interconnect scheme to achieve higher performance and energy-efficiency in wide voltage range, while the second one presents a non Von Neumann architecture, a hybrid in-/near-memory Compute SRAM (CRAM), to address the locality issue. The next two works focus on low-power low-leakage SRAM design for Intelligent sensors. The third prototype is a low power memory design for a deep learning processor with 270KB custom SRAM and Non-Uniform Memory Access architecture. The fourth prototype is an ultra-low leakage SRAM for motion-triggered low power smart imager sensor system with voltage domain stacking and a novel array swapping mechanism. The work presented in this dissertation exploits various optimizations in both architecture level (exploiting temporal and spatial locality) and circuit customization to overcome the main challenges in making extremely energy-efficient battery-powered intelligent mobile devices. The impact of the work is significant in the era of Internet-of-Things (IoT) and the age of AI when the mobile computing systems get ubiquitous, intelligent and longer battery life, powered by these proposed solutions.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/155232/1/jiwang_1.pd

    A fault-tolerant last level cache for CMPs operating at ultra-low voltage

    Get PDF
    Voltage scaling to values near the threshold voltage is a promising technique to hold off the many-core power wall. However, as voltage decreases, some SRAM cells are unable to operate reliably and show a behavior consistent with a hard fault. Block disabling is a micro-architectural technique that allows low-voltage operation by deactivating faulty cache entries, at the expense of reducing the effective cache capacity. In the case of the last-level cache, this capacity reduction leads to an increase in off-chip memory accesses, diminishing the overall energy benefit of reducing the voltage supply. In this work, we exploit the reuse locality and the intrinsic redundancy of multi-level inclusive hierarchies to enhance the performance of block disabling with negligible cost. The proposed fault-aware last-level cache management policy maps critical blocks, those not present in private caches and with a higher probability of being reused, to active cache entries. Our evaluation shows that this fault-aware management results in up to 37.3% and 54.2% fewer misses per kilo instruction (MPKI) than block disabling for multiprogrammed and parallel workloads, respectively. This translates to performance enhancements of up to 13% and 34.6% for multiprogrammed and parallel workloads, respectively.Peer ReviewedPostprint (author's final draft

    Concertina: Squeezing in cache content to operate at near-threshold voltage

    Get PDF
    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Scaling supply voltage to values near the threshold voltage allows a dramatic decrease in the power consumption of processors; however, the lower the voltage, the higher the sensitivity to process variation, and, hence, the lower the reliability. Large SRAM structures, like the last-level cache (LLC), are extremely vulnerable to process variation because they are aggressively sized to satisfy high density requirements. In this paper, we propose Concertina, an LLC designed to enable reliable operation at low voltages with conventional SRAM cells. Based on the observation that for many applications the LLC contains large amounts of null data, Concertina compresses cache blocks in order that they can be allocated to cache entries with faulty cells, enabling use of 100 percent of the LLC capacity. To distribute blocks among cache entries, Concertina implements a compression- and fault-aware insertion/replacement policy that reduces the LLC miss rate. Concertina reaches the performance of an ideal system implementing an LLC that does not suffer from parameter variation with a modest storage overhead. Specifically, performance degrades by less than 2 percent, even when using small SRAM cells, which implies over 90 percent of cache entries having defective cells, and this represents a notable improvement on previously proposed techniques.Peer ReviewedPostprint (author's final draft

    Fault- and Yield-Aware On-Chip Memory Design and Management

    Get PDF
    Ever decreasing device size causes more frequent hard faults, which becomes a serious burden to processor design and yield management. This problem is particularly pronounced in the on-chip memory which consumes up to 70% of a processor' s total chip area. Traditional circuit-level techniques, such as redundancy and error correction code, become less effective in error-prevalent environments because of their large area overhead. In this work, we suggest an architectural solution to building reliable on-chip memory in the future processor environment. Our approaches have two parts, a design framework and architectural techniques for on-chip memory structures. Our design framework provides important architectural evaluation metrics such as yield, area, and performance based on low level defects and process variations parameters. Processor architects can quickly evaluate their designs' characteristics in terms of yield, area, and performance. With the framework, we develop architectural yield enhancement solutions for on-chip memory structures including L1 cache, L2 cache and directory memory. Our proposed solutions greatly improve yield with negligible area and performance overhead. Furthermore, we develop a decoupled yield model of compute cores and L2 caches in CMPs, which show that there will be many more L2 caches than compute cores in a chip. We propose efficient utilization techniques for excess caches. Evaluation results show that excess caches significantly improve overall performance of CMPs

    Gestión de contenidos en caches operando a bajo voltaje

    Get PDF
    La eficiencia energética de las caches en chip puede mejorarse reduciendo su voltaje de alimentación (Vdd ). Sin embargo, este escalado de Vdd está limitado a una tensión Vddmin por debajo de la cual algunas celdas SRAM (Static Random Access Memory) puede que no operen de forma fiable. Block disabling (BD) es una técnica microarquitectónica que permite operar a tensiones muy bajas desactivando aquellas entradas que contienen alguna celda que no opera de forma fiable, aunque a cambio de reducir la capacidad efectiva de la cache. Se utiliza en caches de último nivel (LLC), donde el ahorro potencial es mayor. Sin embargo, para algunas aplicaciones, el incremento de consumo debido a los accesos a memoria fuera del chip no compensa el ahorro energético conseguido en la LLC. Este trabajo aprovecha recursos existentes en los multiprocesadores, como son la jerarqui´a de memoria en chip y el mecanismo de coherencia, para mejorar las prestaciones de BD. En concreto, proponemos explotar la redundancia natural existente en una jerarqui´a de cache inclusiva para mitigar la pérdida de rendimiento debida a la reducción en la capacidad de la LLC. También proponemos una nueva poli´tica de gestión de contenidos consciente de la existencia de entradas de cache defectuosas. Utilizando la información de reúso, el algoritmo de reemplazo asigna entradas de cache operativas a aquellos bloques con más probabilidad de ser referenciados. Las técnicas propuestas llegan a reducir el MPKI hasta en un 36.4 % respecto a block disabling, mejorando su rendimiento entre un 2 y un 13%.Peer ReviewedPostprint (author's final draft

    Yield-Aware Leakage Power Reduction of On-Chip SRAMs

    Get PDF
    Leakage power dissipation of on-chip static random access memories (SRAMs) constitutes a significant fraction of the total chip power consumption in state-of-the-art microprocessors and system-on-chips (SoCs). Scaling the supply voltage of SRAMs during idle periods is a simple yet effective technique to reduce their leakage power consumption. However, supply voltage scaling also results in the degradation of the cells’ robustness, and thus reduces their capability to retain data reliably. This is particularly resulting in the failure of an increasing number of cells that are already weakened by excessive process parameters variations and/or manufacturing imperfections in nano-meter technologies. Thus, with technology scaling, it is becoming increasingly challenging to maintain the yield while attempting to reduce the leakage power of SRAMs. This research focuses on characterizing the yield-leakage tradeoffs and developing novel techniques for a yield-aware leakage power reduction of SRAMs. We first demonstrate that new fault behaviors emerge with the introduction of a low-leakage standby mode to SRAMs. In particular, it is shown that there are some types of defects in SRAM cells that start to cause failures only when the drowsy mode is activated. These defects are not sensitized in the active operating mode, and thus escape the traditional March tests. Fault models for these newly observed fault behaviors are developed and described in this thesis. Then, a new low-complexity test algorithm, called March RAD, is proposed that is capable of detecting all the drowsy faults as well as the simple traditional faults. Extreme process parameters variations can also result in SRAM cells with very weak data-retention capability. The probability of such cells may be very rare in small memory arrays, however, in large arrays, their probability is magnified by the huge number of bit-cells integrated on a single chip. Hence, it is critical also to account for such extremal events while attempting to scale the supply voltage of SRAMs. To estimate the statistics of such rare events within a reasonable computational time, we have employed concepts from extreme value theory (EVT). This has enabled us to accurately model the tail of the cell failure probability distribution versus the supply voltage. Analytical models are then developed to characterize the yield-leakage tradeoffs in large modern SRAMs. It is shown that even a moderate scaling of the supply voltage of large SRAMs can potentially result in significant yield losses, especially in processes with highly fluctuating parameters. Thus, we have investigated the application of fault-tolerance techniques for a more efficient leakage reduction of SRAMs. These techniques allow for a more aggressive voltage scaling by providing tolerance to the failures that might occur during the sleep mode. The results show that in a 45-nm technology, assuming 10% variation in transistors threshold voltage, repairing a 64KB memory using only 8 redundant rows or incorporating single error correcting codes (ECCs) allows for ~90% leakage reduction while incurring only ~1% yield loss. The combination of redundancy and ECC, however, allows to reach the practical limits of leakage reduction in the analyzed benchmark, i.e., ~95%. Applying an identical standby voltage to all dies, regardless of their specific process parameters variations, can result in too many cell failures in some dies with heavily skewed process parameters, so that they may no longer be salvageable by the employed fault-tolerance techniques. To compensate for the inter-die variations, we have proposed to tune the standby voltage of each individual die to its corresponding minimum level, after manufacturing. A test algorithm is presented that can be used to identify the minimum applicable standby voltage to each individual memory die. A possible implementation of the proposed tuning technique is also demonstrated. Simulation results in a 45-nm predictive technology show that tuning standby voltage of SRAMs can enhance data-retention yield by an additional 10%−50%, depending on the severity of the variations

    Characterization and mitigation of process variation in digital circuits and systems

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 155-166).Process variation threatens to negate a whole generation of scaling in advanced process technologies due to performance and power spreads of greater than 30-50%. Mitigating this impact requires a thorough understanding of the variation sources, magnitudes and spatial components at the device, circuit and architectural levels. This thesis explores the impacts of variation at each of these levels and evaluates techniques to alleviate them in the context of digital circuits and systems. At the device level, we propose isolation and measurement of variation in the intrinsic threshold voltage of a MOSFET using sub-threshold leakage currents. Analysis of the measured data, from a test-chip implemented on a 0. 18[mu]m CMOS process, indicates that variation in MOSFET threshold voltage is a truly random process dependent only on device dimensions. Further decomposition of the observed variation reveals no systematic within-die variation components nor any spatial correlation. A second test-chip capable of characterizing spatial variation in digital circuits is developed and implemented in a 90nm triple-well CMOS process. Measured variation results show that the within-die component of variation is small at high voltages but is an increasing fraction of the total variation as power-supply voltage decreases. Once again, the data shows no evidence of within-die spatial correlation and only weak systematic components. Evaluation of adaptive body-biasing and voltage scaling as variation mitigation techniques proves voltage scaling is more effective in performance modification with reduced impact to idle power compared to body-biasing.(cont.) Finally, the addition of power-supply voltages in a massively parallel multicore processor is explored to reduce the energy required to cope with process variation. An analytic optimization framework is developed and analyzed; using a custom simulation methodology, total energy of a hypothetical 1K-core processor based on the RAW core is reduced by 6-16% with the addition of only a single voltage. Analysis of yield versus required energy demonstrates that a combination of disabling poor-performing cores and additional power-supply voltages results in an optimal trade-off between performance and energy.by Nigel Anthony Drego.Ph.D
    corecore