199 research outputs found

    Yield-Aware Leakage Power Reduction of On-Chip SRAMs

    Get PDF
    Leakage power dissipation of on-chip static random access memories (SRAMs) constitutes a significant fraction of the total chip power consumption in state-of-the-art microprocessors and system-on-chips (SoCs). Scaling the supply voltage of SRAMs during idle periods is a simple yet effective technique to reduce their leakage power consumption. However, supply voltage scaling also results in the degradation of the cells’ robustness, and thus reduces their capability to retain data reliably. This is particularly resulting in the failure of an increasing number of cells that are already weakened by excessive process parameters variations and/or manufacturing imperfections in nano-meter technologies. Thus, with technology scaling, it is becoming increasingly challenging to maintain the yield while attempting to reduce the leakage power of SRAMs. This research focuses on characterizing the yield-leakage tradeoffs and developing novel techniques for a yield-aware leakage power reduction of SRAMs. We first demonstrate that new fault behaviors emerge with the introduction of a low-leakage standby mode to SRAMs. In particular, it is shown that there are some types of defects in SRAM cells that start to cause failures only when the drowsy mode is activated. These defects are not sensitized in the active operating mode, and thus escape the traditional March tests. Fault models for these newly observed fault behaviors are developed and described in this thesis. Then, a new low-complexity test algorithm, called March RAD, is proposed that is capable of detecting all the drowsy faults as well as the simple traditional faults. Extreme process parameters variations can also result in SRAM cells with very weak data-retention capability. The probability of such cells may be very rare in small memory arrays, however, in large arrays, their probability is magnified by the huge number of bit-cells integrated on a single chip. Hence, it is critical also to account for such extremal events while attempting to scale the supply voltage of SRAMs. To estimate the statistics of such rare events within a reasonable computational time, we have employed concepts from extreme value theory (EVT). This has enabled us to accurately model the tail of the cell failure probability distribution versus the supply voltage. Analytical models are then developed to characterize the yield-leakage tradeoffs in large modern SRAMs. It is shown that even a moderate scaling of the supply voltage of large SRAMs can potentially result in significant yield losses, especially in processes with highly fluctuating parameters. Thus, we have investigated the application of fault-tolerance techniques for a more efficient leakage reduction of SRAMs. These techniques allow for a more aggressive voltage scaling by providing tolerance to the failures that might occur during the sleep mode. The results show that in a 45-nm technology, assuming 10% variation in transistors threshold voltage, repairing a 64KB memory using only 8 redundant rows or incorporating single error correcting codes (ECCs) allows for ~90% leakage reduction while incurring only ~1% yield loss. The combination of redundancy and ECC, however, allows to reach the practical limits of leakage reduction in the analyzed benchmark, i.e., ~95%. Applying an identical standby voltage to all dies, regardless of their specific process parameters variations, can result in too many cell failures in some dies with heavily skewed process parameters, so that they may no longer be salvageable by the employed fault-tolerance techniques. To compensate for the inter-die variations, we have proposed to tune the standby voltage of each individual die to its corresponding minimum level, after manufacturing. A test algorithm is presented that can be used to identify the minimum applicable standby voltage to each individual memory die. A possible implementation of the proposed tuning technique is also demonstrated. Simulation results in a 45-nm predictive technology show that tuning standby voltage of SRAMs can enhance data-retention yield by an additional 10%−50%, depending on the severity of the variations

    Reliable chip design from low powered unreliable components

    Get PDF
    The pace of technological improvement of the semiconductor market is driven by Moore’s Law, enabling chip transistor density to double every two years. The transistors would continue to decline in cost and size but increase in power. The continuous transistor scaling and extremely lower power constraints in modern Very Large Scale Integrated(VLSI) chips can potentially supersede the benefits of the technology shrinking due to reliability issues. As VLSI technology scales into nanoscale regime, fundamental physical limits are approached, and higher levels of variability, performance degradation, and higher rates of manufacturing defects are experienced. Soft errors, which traditionally affected only the memories, are now also resulting in logic circuit reliability degradation. A solution to these limitations is to integrate reliability assessment techniques into the Integrated Circuit(IC) design flow. This thesis investigates four aspects of reliability driven circuit design: a)Reliability estimation; b) Reliability optimization; c) Fault-tolerant techniques, and d) Delay degradation analysis. To guide the reliability driven synthesis and optimization of combinational circuits, highly accurate probability based reliability estimation methodology christened Conditional Probabilistic Error Propagation(CPEP) algorithm is developed to compute the impact of gate failures on the circuit output. CPEP guides the proposed rewriting based logic optimization algorithm employing local transformations. The main idea behind this methodology is to replace parts of the circuit with functionally equivalent but more reliable counterparts chosen from a precomputed subset of Negation-Permutation-Negation(NPN) classes of 4-variable functions. Cut enumeration and Boolean matching driven by reliability-aware optimization algorithm are used to identify the best possible replacement candidates. Experiments on a set of MCNC benchmark circuits and 8051 functional microcontroller units indicate that the proposed framework can achieve up to 75% reduction of output error probability. On average, about 14% SER reduction is obtained at the expense of very low area overhead of 6.57% that results in 13.52% higher power consumption. The next contribution of the research describes a novel methodology to design fault tolerant circuitry by employing the error correction codes known as Codeword Prediction Encoder(CPE). Traditional fault tolerant techniques analyze the circuit reliability issue from a static point of view neglecting the dynamic errors. In the context of communication and storage, the study of novel methods for reliable data transmission under unreliable hardware is an increasing priority. The idea of CPE is adapted from the field of forward error correction for telecommunications focusing on both encoding aspects and error correction capabilities. The proposed Augmented Encoding solution consists of computing an augmented codeword that contains both the codeword to be transmitted on the channel and extra parity bits. A Computer Aided Development(CAD) framework known as CPE simulator is developed providing a unified platform that comprises a novel encoder and fault tolerant LDPC decoders. Experiments on a set of encoders with different coding rates and different decoders indicate that the proposed framework can correct all errors under specific scenarios. On average, about 1000 times improvement in Soft Error Rate(SER) reduction is achieved. Last part of the research is the Inverse Gaussian Distribution(IGD) based delay model applicable to both combinational and sequential elements for sub-powered circuits. The Probability Density Function(PDF) based delay model accurately captures the delay behavior of all the basic gates in the library database. The IGD model employs these necessary parameters, and the delay estimation accuracy is demonstrated by evaluating multiple circuits. Experiments results indicate that the IGD based approach provides a high matching against HSPICE Monte Carlo simulation results, with an average error less than 1.9% and 1.2% for the 8-bit Ripple Carry Adder(RCA), and 8-bit De-Multiplexer(DEMUX) and Multiplexer(MUX) respectively

    Dependable Embedded Systems

    Get PDF
    This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems

    Cross layer reliability estimation for digital systems

    Get PDF
    Forthcoming manufacturing technologies hold the promise to increase multifuctional computing systems performance and functionality thanks to a remarkable growth of the device integration density. Despite the benefits introduced by this technology improvements, reliability is becoming a key challenge for the semiconductor industry. With transistor size reaching the atomic dimensions, vulnerability to unavoidable fluctuations in the manufacturing process and environmental stress rise dramatically. Failing to meet a reliability requirement may add excessive re-design cost to recover and may have severe consequences on the success of a product. %Worst-case design with large margins to guarantee reliable operation has been employed for long time. However, it is reaching a limit that makes it economically unsustainable due to its performance, area, and power cost. One of the open challenges for future technologies is building ``dependable'' systems on top of unreliable components, which will degrade and even fail during normal lifetime of the chip. Conventional design techniques are highly inefficient. They expend significant amount of energy to tolerate the device unpredictability by adding safety margins to a circuit's operating voltage, clock frequency or charge stored per bit. Unfortunately, the additional cost introduced to compensate unreliability are rapidly becoming unacceptable in today's environment where power consumption is often the limiting factor for integrated circuit performance, and energy efficiency is a top concern. Attention should be payed to tailor techniques to improve the reliability of a system on the basis of its requirements, ending up with cost-effective solutions favoring the success of the product on the market. Cross-layer reliability is one of the most promising approaches to achieve this goal. Cross-layer reliability techniques take into account the interactions between the layers composing a complex system (i.e., technology, hardware and software layers) to implement efficient cross-layer fault mitigation mechanisms. Fault tolerance mechanism are carefully implemented at different layers starting from the technology up to the software layer to carefully optimize the system by exploiting the inner capability of each layer to mask lower level faults. For this purpose, cross-layer reliability design techniques need to be complemented with cross-layer reliability evaluation tools, able to precisely assess the reliability level of a selected design early in the design cycle. Accurate and early reliability estimates would enable the exploration of the system design space and the optimization of multiple constraints such as performance, power consumption, cost and reliability. This Ph.D. thesis is devoted to the development of new methodologies and tools to evaluate and optimize the reliability of complex digital systems during the early design stages. More specifically, techniques addressing hardware accelerators (i.e., FPGAs and GPUs), microprocessors and full systems are discussed. All developed methodologies are presented in conjunction with their application to real-world use cases belonging to different computational domains

    High-Density Solid-State Memory Devices and Technologies

    Get PDF
    This Special Issue aims to examine high-density solid-state memory devices and technologies from various standpoints in an attempt to foster their continuous success in the future. Considering that broadening of the range of applications will likely offer different types of solid-state memories their chance in the spotlight, the Special Issue is not focused on a specific storage solution but rather embraces all the most relevant solid-state memory devices and technologies currently on stage. Even the subjects dealt with in this Special Issue are widespread, ranging from process and design issues/innovations to the experimental and theoretical analysis of the operation and from the performance and reliability of memory devices and arrays to the exploitation of solid-state memories to pursue new computing paradigms

    The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study

    Get PDF
    As DRAM cells continue to shrink, they become more susceptible to retention failures. DRAM cells that permanently exhibit short retention times are fairly easy to identify and repair through the use of memory tests and row and column redundancy. However, the retention time of many cells may vary over time due to a property called Variable Retention Time (VRT). Since these cells intermittently transition between failing and non-failing states, they are particularly difficult to identify through memory tests alone. In addition, the high temperature packaging process may aggravate this problem as the susceptibility of cells to VRT increases after the assembly of DRAM chips. A promising alternative to manufacturetime testing is to detect and mitigate retention failures after the system has become operational. Such a system would require mechanisms to detect and mitigate retention failures in the field, but woul

    Cross-layer Soft Error Analysis and Mitigation at Nanoscale Technologies

    Get PDF
    This thesis addresses the challenge of soft error modeling and mitigation in nansoscale technology nodes and pushes the state-of-the-art forward by proposing novel modeling, analyze and mitigation techniques. The proposed soft error sensitivity analysis platform accurately models both error generation and propagation starting from a technology dependent device level simulations all the way to workload dependent application level analysis

    Adaptive Intelligent Systems for Extreme Environments

    Get PDF
    As embedded processors become powerful, a growing number of embedded systems equipped with artificial intelligence (AI) algorithms have been used in radiation environments to perform routine tasks to reduce radiation risk for human workers. On the one hand, because of the low price, commercial-off-the-shelf devices and components are becoming increasingly popular to make such tasks more affordable. Meanwhile, it also presents new challenges to improve radiation tolerance, the capability to conduct multiple AI tasks and deliver the power efficiency of the embedded systems in harsh environments. There are three aspects of research work that have been completed in this thesis: 1) a fast simulation method for analysis of single event effect (SEE) in integrated circuits, 2) a self-refresh scheme to detect and correct bit-flips in random access memory (RAM), and 3) a hardware AI system with dynamic hardware accelerators and AI models for increasing flexibility and efficiency. The variances of the physical parameters in practical implementation, such as the nature of the particle, linear energy transfer and circuit characteristics, may have a large impact on the final simulation accuracy, which will significantly increase the complexity and cost in the workflow of the transistor level simulation for large-scale circuits. It makes it difficult to conduct SEE simulations for large-scale circuits. Therefore, in the first research work, a new SEE simulation scheme is proposed, to offer a fast and cost-efficient method to evaluate and compare the performance of large-scale circuits which subject to the effects of radiation particles. The advantages of transistor and hardware description language (HDL) simulations are combined here to produce accurate SEE digital error models for rapid error analysis in large-scale circuits. Under the proposed scheme, time-consuming back-end steps are skipped. The SEE analysis for large-scale circuits can be completed in just few hours. In high-radiation environments, bit-flips in RAMs can not only occur but may also be accumulated. However, the typical error mitigation methods can not handle high error rates with low hardware costs. In the second work, an adaptive scheme combined with correcting codes and refreshing techniques is proposed, to correct errors and mitigate error accumulation in extreme radiation environments. This scheme is proposed to continuously refresh the data in RAMs so that errors can not be accumulated. Furthermore, because the proposed design can share the same ports with the user module without changing the timing sequence, it thus can be easily applied to the system where the hardware modules are designed with fixed reading and writing latency. It is a challenge to implement intelligent systems with constrained hardware resources. In the third work, an adaptive hardware resource management system for multiple AI tasks in harsh environments was designed. Inspired by the “refreshing” concept in the second work, we utilise a key feature of FPGAs, partial reconfiguration, to improve the reliability and efficiency of the AI system. More importantly, this feature provides the capability to manage the hardware resources for deep learning acceleration. In the proposed design, the on-chip hardware resources are dynamically managed to improve the flexibility, performance and power efficiency of deep learning inference systems. The deep learning units provided by Xilinx are used to perform multiple AI tasks simultaneously, and the experiments show significant improvements in power efficiency for a wide range of scenarios with different workloads. To further improve the performance of the system, the concept of reconfiguration was further extended. As a result, an adaptive DL software framework was designed. This framework can provide a significant level of adaptability support for various deep learning algorithms on an FPGA-based edge computing platform. To meet the specific accuracy and latency requirements derived from the running applications and operating environments, the platform may dynamically update hardware and software (e.g., processing pipelines) to achieve better cost, power, and processing efficiency compared to the static system

    Cost-Efficient Soft-Error Resiliency for ASIP-based Embedded Systems

    Full text link
    Recent decades have witnessed the rapid growth of embedded systems. At present, embedded systems are widely applied in a broad range of critical applications including automotive electronics, telecommunication, healthcare, industrial electronics, consumer electronics military and aerospace. Human society will continue to be greatly transformed by the pervasive deployment of embedded systems. Consequently, substantial amount of efforts from both industry and academic communities have contributed to the research and development of embedded systems. Application-specific instruction-set processor (ASIP) is one of the key advances in embedded processor technology, and a crucial component in some embedded systems. Soft errors have been directly observed since the 1970s. As devices scale, the exponential increase in the integration of computing systems occurs, which leads to correspondingly decrease in the reliability of computing systems. Today, major research forums state that soft errors are one of the major design technology challenges at and beyond the 22 nm technology node. Therefore, a large number of soft-error solutions, including error detection and recovery, have been proposed from differing perspectives. Nonetheless, most of the existing solutions are designed for general or high-performance systems which are different to embedded systems. For embedded systems, the soft-error solutions must be cost-efficient, which requires the tailoring of the processor architecture with respect to the feature of the target application. This thesis embodies a series of explorations for cost-efficient soft-error solutions for ASIP-based embedded systems. In this exploration, five major solutions are proposed. The first proposed solution realizes checkpoint recovery in ASIPs. By generating customized instructions, ASIP-implemented checkpoint recovery can perform at a finer granularity than what was previously possible. The fault-free performance overhead of this solution is only 1.45% on average. The recovery delay is only 62 cycles at the worst case. The area and leakage power overheads are 44.4% and 45.6% on average. The second solution explores utilizing two primitive error recovery techniques jointly. This solution includes three application-specific optimization methodologies. This solution generates the optimized error-resilient ASIPs, based on the characteristics of primitive error recovery techniques, static reliability analysis and design constraints. The resultant ASIP can be configured to perform at runtime according to the optimized recovery scheme. This solution can strategically enhance cost-efficiency for error recovery. In order to guarantee cost-efficiency in unpredictable runtime situations, the third solution explores runtime adaptation for error recovery. This solution aims to budget and adapt the error recovery operations, so as to spend the resources intelligently and to tolerate adverse influences of runtime variations. The resultant ASIP can make runtime decisions to determine the activation of spatial and temporal redundancies, according to the runtime situations. At the best case, this solution can achieve almost 50x reliability gain over the state of the art solutions. Given the increasing demand for multi-core computing systems, the last two proposed solutions target error recovery in multi-core ASIPs. The first solution of these two explores ASIP-implemented fine-grained process migration. This solution is a key infrastructure, which allows cost-efficient task management, for realizing cost-efficient soft-error recovery in multi-core ASIPs. The average time cost is only 289 machine cycles to perform process migration. The last solution explores using dynamic and adaptive mapping to assign heterogeneous recovery operations to the tasks in the multi-core context. This solution allows each individual ASIP-based processing core to dynamically adapt its specific error recovery functionality according to the corresponding task's characteristics, in terms of soft error vulnerability and execution time deadline. This solution can significantly improve the reliability of the system by almost two times, with graceful constraint penalty, in comparison to the state-of-the-art counterparts

    The 1992 4th NASA SERC Symposium on VLSI Design

    Get PDF
    Papers from the fourth annual NASA Symposium on VLSI Design, co-sponsored by the IEEE, are presented. Each year this symposium is organized by the NASA Space Engineering Research Center (SERC) at the University of Idaho and is held in conjunction with a quarterly meeting of the NASA Data System Technology Working Group (DSTWG). One task of the DSTWG is to develop new electronic technologies that will meet next generation electronic data system needs. The symposium provides insights into developments in VLSI and digital systems which can be used to increase data systems performance. The NASA SERC is proud to offer, at its fourth symposium on VLSI design, presentations by an outstanding set of individuals from national laboratories, the electronics industry, and universities. These speakers share insights into next generation advances that will serve as a basis for future VLSI design
    corecore