112 research outputs found

    Microprocessor error diagnosis by trace monitoring under laser testing

    Get PDF
    This work explores the diagnosis capabilities of the enriched information provided by microprocessors trace subsystem combined with laser fault injection. Laser fault injection campaigns with delimited architectural regions have been accomplished on an ARM Cortex-A9 device. Experimental results demonstrate the capability of the presented technique to provide additional information of the various error mechanisms that can happen in a microprocessor. A comparison with radiation campaigns presented in previous work is also discussed, showing that laser fault injection results are in good agreement with neutron and proton radiation results

    Microarchitecture-level reliability assessment of multi-bit upsets in processors

    Get PDF
    Η συνεχιζόμενη μείωση στις διαστάσεις των μοντέρνων Ολοκληρωμένων Κυκλωμάτων (Ο.Κ.) οδηγούν στον ολοένα και πιο σημαντικό ρόλο των αξιολογήσεων αξιοπιστίας και ευπάθειας στον επεξεργαστή, σε πρόωρα στάδια της σχεδίασης (pre-silicon validation). Με την εξέλιξη των τεχνολογικών κόμβων, τα αποτελέσματα της ακτινοβολίας παίζουν μεγαλύτερο ρόλο, οδηγώντας σε πιο σημαντικά αποτελέσματα στις συσκευές, με μια επιπρόσθετη αύξηση σε σφάλματα πολλαπλών bit. Συνεπώς, είναι καθοριστική η χρησιμοποίηση κάποιων κοινών μηχανισμών εισαγωγής σφαλμάτων για την αξιολόγηση κάθε σχεδίου, χρησιμοποιώντας προσομοιωτές μικρό-αρχιτεκτονικής, που μας παρέχουν ευελιξία και βελτιωμένη ταχύτητα, σε σύγκριση με τα σχέδια Επιπέδου Μεταφοράς Καταχωρητή. Αυτή η διπλωματική εργασία, εστιάζει στα σφάλματα πολλών bit, παρουσιάζοντας τα αποτελέσματα τους σε διαφορετικές δομές ενός μικρό-αρχιτεκτονικού μοντέλου του επεξεργαστή ARM Cortex-A9, που έχει υλοποιηθεί στον προσομοιωτή Gem5. Για αυτό τον λόγο χρησιμοποιείται για τις εκστρατείες εισαγωγής σφαλμάτων o GeFIN (Gem-5 based Fault INjector), με την προσθήκη μιας βελτιωμένης γεννήτριας σφαλμάτων, για τη δημιουργία μασκών σφαλμάτων με κάποια πολύ συγκεκριμένα χαρακτηριστικά. Η βελτιωμένη έκδοση της γεννήτριας, περιλαμβάνει την δυνατότητα για την εισαγωγή σφαλμάτων πολλών bit σε γειτονικές περιοχές κάθε δομής, μια πολύ συνηθισμένη περίπτωση σε πραγματικά περιβάλλοντα. Η γεννήτρια περιλαμβάνει επίσης της δυνατότητα για την εισαγωγή σφαλμάτων σε διεμπλεκόμενες (interleaved) μνήμες, ένας μηχανισμός που χρησιμοποιείται για το περιορισμό των αποτελεσμάτων των σφαλμάτων πολλών bit. Τα αποτελέσματα αυτής της διπλωματικής εργασίας, έδειξαν ότι κάποιες συγκεκριμένες δομές του επεξεργαστή-υπό-εξέταση (π.χ. ο Instruction Translation Lookaside Buffer) έδειξαν μεγάλη ευπάθεια στην εισαγωγή σφαλμάτων, με ποσοστά έως και 25% σωστών εκτελέσεων για 1000 πειράματα, ενώ άλλες δομές όπως οι Κρυφές Μνήμες Εντολών και Δεδομένων 1ου επιπέδου και η Κρυφή Μνήμη 2ου επιπέδου, έδειξαν μεγαλύτερη ευπάθεια στον αυξανόμενο αριθμό εισαγόμενων σφαλμάτων, με διακυμάνσεις μέχρι και 24% ανάμεσα στη εισαγωγή ενός και τριών σφαλμάτων στην κρυφή μνήμη 1ου επιπέδου. Αυτοί οι αριθμοί σχετιζόταν με τον θεωρητικό Architectural Vulnerability Factor (AVF) και ήταν ανεξάρτητοι από την τεχνολογία κατασκευής. Πραγματοποιήθηκε μια επέκταση στους υπολογισμούς για τον υπολογισμό των AVFs για κάθε τεχνολογικό κόμβο από 250 έως 22 nm, που έδειξε αυξημένα ποσοστά AVF όσο ο κόμβος μειωνόταν. Τέλος, πραγματοποιήθηκε μια ανάλυση αξιοπιστίας, χρησιμοποιώντας την μετρική Failures in Time (FIT), που έδειξε του υψηλότερους αριθμούς για την Κρυφή Μνήμη 2ου επιπέδου, κυρίως λόγω του μεγέθους της (4 MBits) με ένα FIT ίσο με 822.9 στα 130 nm. Ο FIT του επεξεργαστή είχε μέγιστο το 918 στον ίδιο κόμβο, ενώ παρατηρήσαμε ότι για κόμβους μικρότερους από 130 nm οι FIT μειώνονται, κυρίως επειδή υπάρχει μείωση στον παράγοντα raw FIT κάθε τεχνολογίας.The continuing decrease in feature sizes for modern Integrated Circuits (ICs) leads to an ever-important role of reliability and vulnerability assessments on the core in early stages of the design (pre-silicon validation). With the increase of the lithography resolution in recent technological nodes, the radiation effects play a bigger role, leading to more severe effects in the devices and increased numbers of multi-bit faults. Therefore, it is crucial to utilize some common fault injection mechanisms to evaluate each design, using micro-architectural simulators, which provide us with flexibility and improved latency, compared to RTL (Register Transfer Level) designs. This thesis focuses on the multi-bit faults, showing their effects on different components of a microarchitectural model of the ARM Cortex-A9 core, implemented on the Gem5 simulator. For that, the GeFIN (Gem-5 based Fault INjector) is used for the fault injection campaigns, with the addition of an improved fault mask generation tool for the creation of fault masks with some particular characteristics. The improved version of the fault mask generator includes the capability for the injection of multi-bit faults in adjacent areas of a structure, a case very common in real environments. The generator also includes the ability to insert faults in interleaved memories, a widely used technique to mitigate the effects of multiple bit upsets. The results of this study showed that some specific components of the core under test (e.g. the Instruction Translation Lookaside Buffer) showed significant vulnerability to fault injection, with rates as low as 25% correct executions for 1000 experiments, while others like the Level 1 Data/Instruction Caches and the Level 2 Cache showed bigger vulnerability to the increasing number of faults injected, with a variation of as high as 24% between single and triple bit fault injection for the L1 D-Cache. Those numbers were related to the “theoretical” Architectural Vulnerability Factor (AVF), independent of the fabrication technology node. An extension in the calculation was done to compute the AVFs for each technology node from 250 nm to 22 nm, showing increasing AVF rates as the node decreases. Lastly, a reliability assessment was done, using the Failures in Time (FIT) metric, which showed the highest numbers for the Level 2 Cache, primarily because of its size (4 MBits) with a FIT of 822.9 at the 130 nm. The FIT of the core showed a high of 918 at the same node, while we observed that for nodes smaller than 130 nm the FITs decreased primarily because of the decrease of the raw FIT factor of each technology

    The Use of Microprocessor Trace Infrastructures for Radiation-Induced Fault Diagnosis

    Get PDF
    This work proposes a methodology to diagnoseradiation-induced faults in a microprocessor using the hardwaretrace infrastructure. The diagnosis capabilities of this approachare demonstrated for an ARM microprocessor under neutronand proton irradiation campaigns. The experimental resultsdemonstrate that the execution status in the precise moment thatthe error occurred can be reconstructed, so that error diagnosiscan be achieved

    Reduction Of Architecture Vulnerability Factor Using Modified Razor Flipflops

    Get PDF
    Research has shown that microprocessors and structures of the microprocessors are vulnerable to alpha Single Event Upsets that affect program correctness and reliability. In this thesis, we have explored the use of Modified Razor flip-flops in the microprocessor to increase the overall reliability of the microprocessor. We have adopted Architecturally Correct Execution (ACE) time based techniques to measure the Architecture Vulnerability Factor (AVF) of high performance microprocessors and their internal structures using the SPEC 2000 integer benchmarks. We have computed the reduction in AVF with the introduction of Modified Razor flip-flops for various combinations of bit-fields that have high vulnerability. However, introduction of Modified Razor flip-flops results in higher area requirement on the die and higher power consumption. We have identified the most cost-effective solution by identifying the fields of these microarchitectural structures - where Modified Razor flip-flops are introduced - that result in the highest percentage decrease in AVF per unit area-power product

    Soft-error resilient on-chip memory structures

    Get PDF
    Soft errors induced by energetic particle strikes in on-chip memory structures, such as L1 data/instruction caches and register files, have become an increasing challenge in designing new generation reliable microprocessors. Due to their transient/random nature, soft errors cannot be captured by traditional verification and testing process due to the irrelevancy to the correctness of the logic. This dissertation is thus focusing on the reliability characterization and cost-effective reliable design of on-chip memories against soft errors. Due to various performance, area/size, and energy constraints in various target systems, many existing unoptimized protection schemes on cache memories may eventually prove significantly inadequate and ineffective. This work develops new lifetime models for data and tag arrays residing in both the data and instruction caches. These models facilitate the characterization of cache vulnerability of the stored items at various lifetime phases. The design methodology is further exemplified by the proposed reliability schemes targeting at specific vulnerable phases. Benchmarking is carried out to showcase the effectiveness of these approaches. The tag array demands high reliability against soft errors while the data array is fully protected in on-chip caches, because of its crucial importance to the correctness of cache accesses. Exploiting the address locality of memory accesses, this work proposes a Tag Replication Buffer (TRB) to protect information integrity of the tag array in the data cache with low performance, energy and area overheads. To provide a comprehensive evaluation of the tag array reliability, this work also proposes a refined evaluation metric, detected-without-replica-TVF (DOR-TVF), which combines the TVF and access-with-replica (AWR) analysis. Based on the DOR-TVF analysis, a TRB scheme with early write-back (TRB-EWB) is proposed, which achieves a zero DOR-TVF at a negligible performance overhead. Recent research, as well as the proposed optimization schemes in this cache vulnerability study, have focused on the design of cost-effective reliable data caches in terms of performance, energy, and area overheads based on the assumption of fixed error rates. However, for systems in operating environments that vary with time or location, those schemes will be either insufficient or over-designed for the changing error rates. This work explores the design of a self-adaptive reliable data cache that dynamically adapts its employed reliability schemes to the changing operating environments in order to maintain a target reliability. The experimental evaluation shows that the self-adaptive data cache achieves similar reliability to a cache protected by the most reliable scheme, while simultaneously minimizing the performance and power overheads. Besides the data/instruction caches, protecting the register file and its data buses is crucial to reliable computing in high-performance microprocessors. Since the register file is in the critical path of the processor pipeline, any reliable design that increases either the pressure on the register file or the register file access latency is not desirable. This work proposes to exploit narrow-width register values, which represent the majority of generated values, for making the duplicates within the same register data item. A detailed architectural vulnerability factor (AVF) analysis shows that this in-register duplication (IRD) scheme significantly reduces the AVF in the register file compared to the conventional design. The experimental evaluation also shows that IRD provides superior read-with-duplicate (RWD) and error detection/recovery rates under heavy error injection as compared to previous reliability schemes, while only incurring a small power overhead. By integrating the proposed reliable designs in data/instruction caches and register files, the vulnerability of the entire microprocessor is dramatically reduced. The new lifetime model, the self-adaptive design and the narrow-width value duplication scheme proposed in this work can also provide guidance to architects toward highly efficient reliable system design

    Cross-layer Soft Error Analysis and Mitigation at Nanoscale Technologies

    Get PDF
    This thesis addresses the challenge of soft error modeling and mitigation in nansoscale technology nodes and pushes the state-of-the-art forward by proposing novel modeling, analyze and mitigation techniques. The proposed soft error sensitivity analysis platform accurately models both error generation and propagation starting from a technology dependent device level simulations all the way to workload dependent application level analysis

    Energy-Aware Data Movement In Non-Volatile Memory Hierarchies

    Get PDF
    While technology scaling enables increased density for memory cells, the intrinsic high leakage power of conventional CMOS technology and the demand for reduced energy consumption inspires the use of emerging technology alternatives such as eDRAM and Non-Volatile Memory (NVM) including STT-MRAM, PCM, and RRAM. The utilization of emerging technology in Last Level Cache (LLC) designs which occupies a signifcant fraction of total die area in Chip Multi Processors (CMPs) introduces new dimensions of vulnerability, energy consumption, and performance delivery. To be specific, a part of this research focuses on eDRAM Bit Upset Vulnerability Factor (BUVF) to assess vulnerable portion of the eDRAM refresh cycle where the critical charge varies depending on the write voltage, storage and bit-line capacitance. This dissertation broaden the study on vulnerability assessment of LLC through investigating the impact of Process Variations (PV) on narrow resistive sensing margins in high-density NVM arrays, including on-chip cache and primary memory. Large-latency and power-hungry Sense Amplifers (SAs) have been adapted to combat PV in the past. Herein, a novel approach is proposed to leverage the PV in NVM arrays using Self-Organized Sub-bank (SOS) design. SOS engages the preferred SA alternative based on the intrinsic as-built behavior of the resistive sensing timing margin to reduce the latency and power consumption while maintaining acceptable access time. On the other hand, this dissertation investigates a novel technique to prioritize the service to 1) Extensive Read Reused Accessed blocks of the LLC that are silently dropped from higher levels of cache, and 2) the portion of the working set that may exhibit distant re-reference interval in L2. In particular, we develop a lightweight Multi-level Access History Profiler to effciently identify ERRA blocks through aggregating the LLC block addresses tagged with identical Most Signifcant Bits into a single entry. Experimental results indicate that the proposed technique can reduce the L2 read miss ratio by 51.7% on average across PARSEC and SPEC2006 workloads. In addition, this dissertation will broaden and apply advancements in theories of subspace recovery to pioneer computationally-aware in-situ operand reconstruction via the novel Logic In Interconnect (LI2) scheme. LI2 will be developed, validated, and re?ned both theoretically and experimentally to realize a radically different approach to post-Moore\u27s Law computing by leveraging low-rank matrices features offering data reconstruction instead of fetching data from main memory to reduce energy/latency cost per data movement. We propose LI2 enhancement to attain high performance delivery in the post-Moore\u27s Law era through equipping the contemporary micro-architecture design with a customized memory controller which orchestrates the memory request for fetching low-rank matrices to customized Fine Grain Reconfigurable Accelerator (FGRA) for reconstruction while the other memory requests are serviced as before. The goal of LI2 is to conquer the high latency/energy required to traverse main memory arrays in the case of LLC miss, by using in-situ construction of the requested data dealing with low-rank matrices. Thus, LI2 exchanges a high volume of data transfers with a novel lightweight reconstruction method under specific conditions using a cross-layer hardware/algorithm approach

    Soft Error Effects on Arm Microprocessors: Early Estimations versus Chip Measurements

    Get PDF
    Extensive research efforts are being carried out to evaluate and improve the reliability of computing devices either through beam experiments or simulation-based fault injection. Unfortunately, it is still largely unclear to which extend fault injection can provide an accurate error rate estimation at early stages and if beam experiments can be used to identify the weakest resources in a device. The importance and challenges associated with a timely, but yet realistic reliability evaluation grow with the increase of complexity in both the hardware domain, with the integration of different types of cores in an SoC (System-on-Chip), and the software domain, with the OS (operating system) required to take full advantage of the available resources. In this paper, we combine and analyze data gathered with extensive beam experiments (on the final physical CPU hardware) and microarchitectural fault injections (on early microarchitectural CPU models). We target a standalone Arm Cortex-A5 CPU and an Arm Cortex-A9 CPU integrated into an SoC and evaluate their reliability in bare-metal and Linux-based configurations. Combining experimental data that covers more than 18 million years of device time with the result of more than 176,000 injections we find that both the SoC integration and the presence of the OS increase the system DUEs (Detected Unrecoverable Errors) rate (for different reasons) but do not significantly impact the SDCs (Silent Data Corruptions) rate which is solely attributed to the CPU core. Our reliability analysis demonstrates that even considering SoC integration and OS inclusion, early, pre-silicon microarchitecture-level fault injection delivers accurate SDC rates estimations and lower bounds for the DUE rates

    SyRA: early system reliability analysis for cross-layer soft errors resilience in memory arrays of microprocessor systems

    Get PDF
    © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Cross-layer reliability is becoming the preferred solution when reliability is a concern in the design of a microprocessor-based system. Nevertheless, deciding how to distribute the error management across the different layers of the system is a very complex task that requires the support of dedicated frameworks for cross-layer reliability analysis. This paper proposes SyRA, a system-level cross-layer early reliability analysis framework for radiation induced soft errors in memory arrays of microprocessor-based systems. The framework exploits a multi-level hybrid Bayesian model to describe the target system and takes advantage of Bayesian inference to estimate different reliability metrics. SyRA implements several mechanisms and features to deal with the complexity of realistic models and implements a complete tool-chain that scales efficiently with the complexity of the system. The simulation time is significantly lower than micro-architecture level or RTL fault-injection experiments with an accuracy high enough to take effective design decisions. To demonstrate the capability of SyRA, we analyzed the reliability of a set of microprocessor-based systems characterized by different microprocessor architectures (i.e., Intel x86, ARM Cortex-A15, ARM Cortex-A9) running both the Linux operating system or bare metal. Each system under analysis executes different software workloads both from benchmark suites and from real applications.Peer ReviewedPostprint (author's final draft

    REFU: Redundant Execution with Idle Functional Units, Fault Tolerant GPGPU architecture

    Get PDF
    The General-Purpose Graphics Processing Units (GPGPU) with energy efficient execution are increasingly used in wide range of applications due to high performance. These GPGPUs are fabricated with the cutting-edge technologies. Shrinking transistor feature size and aggressive voltage scaling has increased the susceptibility of devices to intrinsic and extrinsic noise leading to major reliability issues in the form of the transient faults. Therefore, it is essential to ensure the reliable operation of the GPGPUs in the presence of the transient faults. GPGPUs are designed for high throughput and execute the multiple threads in parallel, that brings a new challenge for the fault detection with minimum overheads across all threads. This paper proposes a new fault detection method called REFU, an architectural solution to detect the transient faults by temporal redundant re-execution of instructions using the idle functional execution units of the GPGPU. The performance of the REFU is evaluated with standard benchmarks, for fault free run across different workloads REFU shows mean performance overhead of 2%, average power overhead of 6%, and peak power overhead of 10%
    corecore