193 research outputs found

    Affordable techniques for dependable microprocessor design

    Get PDF
    As high computing power is available at an affordable cost, we rely on microprocessor-based systems for much greater variety of applications. This dependence indicates that a processor failure could have more diverse impacts on our daily lives. Therefore, dependability is becoming an increasingly important quality measure of microprocessors.;Temporary hardware malfunctions caused by unstable environmental conditions can lead the processor to an incorrect state. This is referred to as a transient error or soft error. Studies have shown that soft errors are the major source of system failures. This dissertation characterizes the soft error behavior on microprocessors and presents new microarchitectural approaches that can realize high dependability with low overhead.;Our fault injection studies using RISC processors have demonstrated that different functional blocks of the processor have distinct susceptibilities to soft errors. The error susceptibility information must be reflected in devising fault tolerance schemes for cost-sensitive applications. Considering the common use of on-chip caches in modern processors, we investigated area-efficient protection schemes for memory arrays. The idea of caching redundant information was exploited to optimize resource utilization for increased dependability. We also developed a mechanism to verify the integrity of data transfer from lower level memories to the primary caches. The results of this study show that by exploiting bus idle cycles and the information redundancy, an almost complete check for the initial memory data transfer is possible without incurring a performance penalty.;For protecting the processor\u27s control logic, which usually remains unprotected, we propose a low-cost reliability enhancement strategy. We classified control logic signals into static and dynamic control depending on their changeability, and applied various techniques including commit-time checking, signature caching, component-level duplication, and control flow monitoring. Our schemes can achieve more than 99% coverage with a very small hardware addition. Finally, a virtual duplex architecture for superscalar processors is presented. In this system-level approach, the processor pipeline is backed up by a partially replicated pipeline. The replication-based checker minimizes the design and verification overheads. For a large-scale superscalar processor, the proposed architecture can bring 61.4% reduction in die area while sustaining the maximum performance

    Soft-error resilient on-chip memory structures

    Get PDF
    Soft errors induced by energetic particle strikes in on-chip memory structures, such as L1 data/instruction caches and register files, have become an increasing challenge in designing new generation reliable microprocessors. Due to their transient/random nature, soft errors cannot be captured by traditional verification and testing process due to the irrelevancy to the correctness of the logic. This dissertation is thus focusing on the reliability characterization and cost-effective reliable design of on-chip memories against soft errors. Due to various performance, area/size, and energy constraints in various target systems, many existing unoptimized protection schemes on cache memories may eventually prove significantly inadequate and ineffective. This work develops new lifetime models for data and tag arrays residing in both the data and instruction caches. These models facilitate the characterization of cache vulnerability of the stored items at various lifetime phases. The design methodology is further exemplified by the proposed reliability schemes targeting at specific vulnerable phases. Benchmarking is carried out to showcase the effectiveness of these approaches. The tag array demands high reliability against soft errors while the data array is fully protected in on-chip caches, because of its crucial importance to the correctness of cache accesses. Exploiting the address locality of memory accesses, this work proposes a Tag Replication Buffer (TRB) to protect information integrity of the tag array in the data cache with low performance, energy and area overheads. To provide a comprehensive evaluation of the tag array reliability, this work also proposes a refined evaluation metric, detected-without-replica-TVF (DOR-TVF), which combines the TVF and access-with-replica (AWR) analysis. Based on the DOR-TVF analysis, a TRB scheme with early write-back (TRB-EWB) is proposed, which achieves a zero DOR-TVF at a negligible performance overhead. Recent research, as well as the proposed optimization schemes in this cache vulnerability study, have focused on the design of cost-effective reliable data caches in terms of performance, energy, and area overheads based on the assumption of fixed error rates. However, for systems in operating environments that vary with time or location, those schemes will be either insufficient or over-designed for the changing error rates. This work explores the design of a self-adaptive reliable data cache that dynamically adapts its employed reliability schemes to the changing operating environments in order to maintain a target reliability. The experimental evaluation shows that the self-adaptive data cache achieves similar reliability to a cache protected by the most reliable scheme, while simultaneously minimizing the performance and power overheads. Besides the data/instruction caches, protecting the register file and its data buses is crucial to reliable computing in high-performance microprocessors. Since the register file is in the critical path of the processor pipeline, any reliable design that increases either the pressure on the register file or the register file access latency is not desirable. This work proposes to exploit narrow-width register values, which represent the majority of generated values, for making the duplicates within the same register data item. A detailed architectural vulnerability factor (AVF) analysis shows that this in-register duplication (IRD) scheme significantly reduces the AVF in the register file compared to the conventional design. The experimental evaluation also shows that IRD provides superior read-with-duplicate (RWD) and error detection/recovery rates under heavy error injection as compared to previous reliability schemes, while only incurring a small power overhead. By integrating the proposed reliable designs in data/instruction caches and register files, the vulnerability of the entire microprocessor is dramatically reduced. The new lifetime model, the self-adaptive design and the narrow-width value duplication scheme proposed in this work can also provide guidance to architects toward highly efficient reliable system design

    Reducing soft errors through operand width aware policies

    Get PDF
    Soft errors are an important challenge in contemporary microprocessors. Particle hits on the components of a processor are expected to create an increasing number of transient errors with each new microprocessor generation. In this paper, we propose simple mechanisms that effectively reduce the vulnerability to soft errors in a processor. Our designs are generally motivated by the fact that many of the produced and consumed values in the processors are narrow and their upper order bits are meaningless. Soft errors caused by any particle strike to these higher order bits can be avoided by simply identifying these narrow values. Alternatively, soft errors can be detected or corrected on the narrow values by replicating the vulnerable portion of the value inside the storage space provided for the upper order bits of these operands. As a faster but less fault tolerant alternative to ECC and parity, we offer a variety of schemes that make use of narrow values and analyze their efficiency in reducing soft error vulnerability of different data-holding components of a processor. On average, techniques that make use of the narrowness of the values can provide 49 percent error detection, 45 percent error correction, or 27 percent error avoidance coverage for single bit upsets in the first level data cache across all Spec2K. In other structures such as the immediate field of the issue queue, an average error detection rate of 64 percent is achieved.Peer ReviewedPostprint (published version

    Control Caching : a fault-tolerant architecture for SEU mitigation in microprocessor control logic

    Get PDF
    The importance of fault tolerance at the processor architecture level has been made increasingly important due to rapid advancements in the design and usage of high performance devices and embedded processors. System level solutions to the challenge of fault tolerance flag errors and utilize penalty cycles to recover through the re-execution of instructions. This motivates the need for a hybrid technique providing fault detection as well as fault masking, with minimal penalty cycles for recovery from detected errors. In this research, we propose Control Caching, an architectural technique comprising of three schemes to protect the control logic of microprocessors against Single Event Upsets (SEUs). High fault coverage with relatively low hardware overhead is obtained by using both fault detection with recovery and fault masking. Control signals are classified as either static or dynamic, and static signals are further classified as opcode dependent and instruction dependent. The strategy for protecting static instruction dependent control signals utilizes a distributed cache of the history of the control bits along with the Triple Modular Redundancy (TMR) concept, while the opcode dependent control signals are protected by a distributed cache which can be used to flag errors. Dynamic signals are protected by selective duplication of datapath components. The techniques are implemented on the OpenRISC 1200 processor. Our simulation results show that fault detection with single cycle recovery is provided for 92% of all instruction executions. FPGA synthesis is performed to analyze the associated cycle time and area overheads

    Exploiting eager register release in a redundantly multi-threaded processor

    Get PDF
    Journal ArticleDue to shrinking transistor sizes and lower supply voltages, transient faults (soft errors) in computer systems are projected to increase by orders of magnitude. Fault detection and recovery can be achieved through redundancy. Redundant multithreading (RMT) is one attractive approach to detect and recover from these errors. However, redundant threads can impose significant performance overheads by competing with the main program for resources such as the register file. In this paper, we propose using eager register release in the main program thread by exploiting the availability of register values in the trailing thread's register space. This performance optimization can help support a smaller register file and potentially reduce register file access time, power consumption, and increase its immunity towards soft errors

    Cross-layer system reliability assessment framework for hardware faults

    Get PDF
    System reliability estimation during early design phases facilitates informed decisions for the integration of effective protection mechanisms against different classes of hardware faults. When not all system abstraction layers (technology, circuit, microarchitecture, software) are factored in such an estimation model, the delivered reliability reports must be excessively pessimistic and thus lead to unacceptably expensive, over-designed systems. We propose a scalable, cross-layer methodology and supporting suite of tools for accurate but fast estimations of computing systems reliability. The backbone of the methodology is a component-based Bayesian model, which effectively calculates system reliability based on the masking probabilities of individual hardware and software components considering their complex interactions. Our detailed experimental evaluation for different technologies, microarchitectures, and benchmarks demonstrates that the proposed model delivers very accurate reliability estimations (FIT rates) compared to statistically significant but slow fault injection campaigns at the microarchitecture level.Peer ReviewedPostprint (author's final draft

    Power efficient approaches to redundant multithreading

    Get PDF
    Journal ArticleNoise and radiation-induced soft errors (transient faults) in computer systems have increased significantly over the last few years and are expected to increase even more as we move toward smaller transistor sizes and lower supply voltages. Fault detection and recovery can be achieved through redundancy. The emergence of chip multiprocessors (CMPs) makes it possible to execute redundant threads on a chip and provide relatively low-cost reliability. State-of-the-art implementations execute two copies of the same program as two threads (redundant multithreading), either on the same or on separate processor cores in a CMP, and periodically check results. Although this solution has favorable performance and reliability properties, every redundant instruction flows through a high-frequency complex out-of-order pipeline, thereby incurring a high power consumption penalty. This paper proposes mechanisms that attempt to provide reliability at a modest power and complexity cost. When executing a redundant thread, the trailing thread benefits from the information produced by the leading thread. We take advantage of this property and comprehensively study different strategies to reduce the power overhead of the trailing core in a CMP. These strategies include dynamic frequency scaling, in-order execution, and parallelization of the trailing thread

    Architectures for dependable modern microprocessors

    Get PDF
    Η εξέλιξη των ολοκληρωμένων κυκλωμάτων σε συνδυασμό με τους αυστηρούς χρονικούς περιορισμούς καθιστούν την επαλήθευση της ορθής λειτουργίας των επεξεργαστών μία εξαιρετικά απαιτητική διαδικασία. Με κριτήριο το στάδιο του κύκλου ζωής ενός επεξεργαστή, από την στιγμή κατασκευής των πρωτοτύπων και έπειτα, οι τεχνικές ελέγχου ορθής λειτουργίας διακρίνονται στις ακόλουθες κατηγορίες: (1) Silicon Debug: Τα πρωτότυπα ολοκληρωμένα κυκλώματα ελέγχονται εξονυχιστικά, (2) Manufacturing Testing: ο τελικό ποιοτικός έλεγχος και (3) In-field verification: Περιλαμβάνει τεχνικές, οι οποίες διασφαλίζουν την λειτουργία του επεξεργαστή σύμφωνα με τις προδιαγραφές του. Η διδακτορική διατριβή προτείνει τα ακόλουθα: (1) Silicon Debug: Η εργασία αποσκοπεί στην επιτάχυνση της διαδικασίας ανίχνευσης σφαλμάτων και στον αυτόματο εντοπισμό τυχαίων προγραμμάτων που δεν περιέχουν νέα -χρήσιμη- πληροφορία σχετικά με την αίτια ενός σφάλματος. Η κεντρική ιδέα αυτής της μεθόδου έγκειται στην αξιοποίηση της έμφυτης ποικιλομορφίας των αρχιτεκτονικών συνόλου εντολών και στην δυνατότητα από-διαμόρφωσης τμημάτων του κυκλώματος, (2) Manufacturing Testing: προτείνεται μία μέθοδο για την βελτιστοποίηση του έλεγχου ορθής λειτουργίας των πολυνηματικών και πολυπύρηνων επεξεργαστών μέσω της χρήση λογισμικού αυτοδοκιμής, (3) Ιn-field verification: Αναλύθηκε σε βάθος η επίδραση που έχουν τα μόνιμα σφάλματα σε μηχανισμούς αύξησης της απόδοσης. Επιπρόσθετα, προτάθηκαν τεχνικές για την ανίχνευση και ανοχή μόνιμων σφαλμάτων υλικού σε μηχανισμούς πρόβλεψης διακλάδωσης.Technology scaling, extreme chip integration and the compelling requirement to diminish the time-to-market window, has rendered microprocessors more prone to design bugs and hardware faults. Microprocessor validation is grouped into the following categories, based on where they intervene in a microprocessor’s lifecycle: (a) Silicon debug: the first hardware prototypes are exhaustively validated, (b) Μanufacturing testing: the final quality control during massive production, and (c) In-field verification: runtime error detection techniques to guarantee correct operation. The contributions of this thesis are the following: (1) Silicon debug: We propose the employment of deconfigurable microprocessor architectures along with a technique to generate self-checking random test programs to avoid the simulation step and triage the redundant debug sessions, (2) Manufacturing testing: We propose a self-test optimization strategy for multithreaded, multicore microprocessors to speedup test program execution time and enhance the fault coverage of hard errors; and (3) In-field verification: We measure the effect of permanent faults performance components. Then, we propose a set of low-cost mechanisms for the detection, diagnosis and performance recovery in the front-end speculative structures. This thesis introduces various novel methodologies to address the validation challenges posed throughout the life-cycle of a chip

    Parallel error detection using heterogeneous cores

    Get PDF
    Microprocessor error detection is increasingly important, as the number of transistors in modern systems heightens their vulnerability. In addition, many modern workloads in domains such as the automotive and health industries are increasingly error intolerant, due to strict safety standards. However, current detection techniques require duplication of all hardware structures, causing a considerable increase in power consumption and chip area. Solutions in the literature involve running the code multiple times on the same hardware, which reduces performance significantly and cannot capture all errors. We have designed a novel hardware-only solution for error detection, that exploits parallelism in checking code which may not exist in the original execution. We pair a high-performance out-of-order core with a set of small low-power cores, each of which checks a portion of the out-of-order core's execution. Our system enables the detection of both hard and soft errors, with low area, power and performance overheads.This work was supported by the Engineering and Physical Sciences Research Council (EPSRC), through grant references EP/K026399/1 and EP/M506485/1, and Arm Ltd

    Approaches to multiprocessor error recovery using an on-chip interconnect subsystem

    Get PDF
    For future multicores, a dedicated interconnect subsystem for on-chip monitors was found to be highly beneficial in terms of scalability, performance and area. In this thesis, such a monitor network (MNoC) is used for multicores to support selective error identification and recovery and maintain target chip reliability in the context of dynamic voltage and frequency scaling (DVFS). A selective shared memory multiprocessor recovery is performed using MNoC in which, when an error is detected, only the group of processors sharing an application with the affected processors are recovered. Although the use of DVFS in contemporary multicores provides significant protection from unpredictable thermal events, a potential side effect can be an increased processor exposure to soft errors. To address this issue, a flexible fault prevention and recovery mechanism has been developed to selectively enable a small amount of per-core dual modular redundancy (DMR) in response to increased vulnerability, as measured by the processor architectural vulnerability factor (AVF). Our new algorithm for DMR deployment aims to provide a stable effective soft error rate (SER) by using DMR in response to DVFS caused by thermal events. The algorithm is implemented in real-time on the multicore using MNoC and controller which evaluates thermal information and multicore performance statistics in addition to error information. DVFS experiments with a multicore simulator using standard benchmarks show an average 6% improvement in overall power consumption and a stable SER by using selective DMR versus continuous DMR deployment
    corecore